Data Center Design Case Studies

DATA CENTER DESIGN CASE STUDIES FROM DMVPN AND WAN EDGE TO SERVER CONNECTIVITY AND VIRTUAL APPLIANCES This material is

Views 227 Downloads 10 File size 8MB

Report DMCA / Copyright

DOWNLOAD FILE

Recommend stories

Citation preview

DATA CENTER DESIGN CASE STUDIES FROM DMVPN AND WAN EDGE TO SERVER CONNECTIVITY AND VIRTUAL APPLIANCES

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

DATA CENTER DESIGN CASE STUDIES Ivan Pepelnjak, CCIE#1354 Emeritus

Copyright © 2014 ipSpace.net AG Fifth revision, November 2014 

Added Replacing the Central Firewall chapter



Added Combine Physical and Virtual Appliances in a Private Cloud chapter



Added Scale-Out Private Cloud Infrastructure chapter



Added Simplify Workload Migration with Virtual Appliances chapter

WARNING AND DISCLAIMER This book is designed to provide information about real-life data center design scenarios. Every effort has been made to make this book as complete and as accurate as possible, but no warranty or fitness is implied. The information is provided on an “as is” basis. The authors, and ipSpace.net shall have neither liability nor responsibility to any person or entity with respect to any loss or damages arising from the information contained in this book.

© Copyright ipSpace.net 2014

Page ii

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

CONTENT AT A GLANCE FOREWORD .......................................................................................................................... XV INTRODUCTION ................................................................................................................. XVII 1

BGP CONVERGENCE OPTIMIZATION ......................................................................1-1

2

INTEGRATING INTERNET VPN WITH MPLS/VPN WAN......................................2-1

3

BGP ROUTING IN DMVPN ACCESS NETWORK ...................................................3-1

4

REDUNDANT DATA CENTER INTERNET CONNECTIVITY ......................................4-1

5

EXTERNAL ROUTING WITH LAYER-2 DATA CENTER INTERCONNECT ................5-1

6

DESIGNING A PRIVATE CLOUD NETWORK INFRASTRUCTURE ..............................6-1

7

REDUNDANT SERVER-TO-NETWORK CONNECTIVITY ..........................................7-1

8

REPLACING THE CENTRAL FIREWALL .......................................................................8-1

9

COMBINE PHYSICAL AND VIRTUAL APPLIANCES IN A PRIVATE CLOUD................9-1

10 HIGH-SPEED MULTI-TENANT ISOLATION ............................................................. 10-1 11 SCALE-OUT PRIVATE CLOUD INFRASTRUCTURE ................................................. 11-1 12 SIMPLIFY WORKLOAD MIGRATION WITH VIRTUAL APPLIANCES ....................... 12-1

© Copyright ipSpace.net 2014

Page iii

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

CONTENTS FOREWORD .......................................................................................................................... XV INTRODUCTION ................................................................................................................. XVII 1

BGP CONVERGENCE OPTIMIZATION ......................................................................1-1 BRIEF NETWORK DESCRIPTION ............................................................................................ 1-3 SOLUTION – EXECUTIVE OVERVIEW .................................................................................... 1-3 DETAILED SOLUTION ............................................................................................................. 1-5 CONCLUSIONS ..................................................................................................................... 1-11

2

INTEGRATING INTERNET VPN WITH MPLS/VPN WAN......................................2-1 IP ROUTING OVERVIEW ........................................................................................................ 2-5 DESIGN REQUIREMENTS......................................................................................................... 2-8 SOLUTION OVERVIEW ........................................................................................................... 2-9 OSPF AS THE INTERNET VPN ROUTING PROTOCOL ...................................................... 2-10 BGP AS THE INTERNET VPN ROUTING PROTOCOL ........................................................ 2-14

© Copyright ipSpace.net 2014

Page iv

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

BGP-BASED WAN NETWORK DESIGN AND IMPLEMENTATION GUIDANCE ............... 2-24 CONCLUSIONS ..................................................................................................................... 2-31

3

BGP ROUTING IN DMVPN ACCESS NETWORK ...................................................3-1 EXISTING IP ROUTING OVERVIEW ....................................................................................... 3-4 IBGP VERSUS EBGP ............................................................................................................... 3-6 USING EBGP IN A DMVPN NETWORK ............................................................................ 3-10 USING IBGP IN A DMVPN NETWORK ............................................................................. 3-22 DESIGN RECOMMENDATIONS ............................................................................................. 3-26

4

REDUNDANT DATA CENTER INTERNET CONNECTIVITY ......................................4-1 SIMPLIFIED TOPOLOGY........................................................................................................... 4-4 IP ADDRESSING AND ROUTING ............................................................................................ 4-5 DESIGN REQUIREMENTS......................................................................................................... 4-7 SOLUTION OVERVIEW ........................................................................................................... 4-8 LAYER-2 WAN BACKBONE ................................................................................................ 4-10 LAYER-3 WAN BACKBONE ................................................................................................ 4-21 CONCLUSIONS ..................................................................................................................... 4-29

© Copyright ipSpace.net 2014

Page v

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

5

EXTERNAL ROUTING WITH LAYER-2 DATA CENTER INTERCONNECT ................5-1 DESIGN REQUIREMENTS......................................................................................................... 5-7 SOLUTION OVERVIEW ........................................................................................................... 5-8 DETAILED SOLUTION – OSPF ............................................................................................ 5-11 DETAILED SOLUTION – INTERNET ROUTING WITH BGP ............................................... 5-14 CONCLUSIONS ..................................................................................................................... 5-23

6

DESIGNING A PRIVATE CLOUD NETWORK INFRASTRUCTURE ..............................6-1 COLLECT THE REQUIREMENTS .............................................................................................. 6-2 PRIVATE CLOUD PLANNING AND DESIGN PROCESS .......................................................... 6-5 DESIGNING THE NETWORK INFRASTRUCTURE................................................................... 6-8 CONCLUSIONS ..................................................................................................................... 6-10

7

REDUNDANT SERVER-TO-NETWORK CONNECTIVITY ..........................................7-1 DESIGN REQUIREMENTS......................................................................................................... 7-3 VLAN-BASED VIRTUAL NETWORKS .................................................................................... 7-3 OVERLAY VIRTUAL NETWORKS.......................................................................................... 7-13 CONCLUSIONS ..................................................................................................................... 7-23

© Copyright ipSpace.net 2014

Page vi

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

8

REPLACING THE CENTRAL FIREWALL .......................................................................8-1 FROM PACKET FILTERS TO STATEFUL FIREWALLS ............................................................... 8-2 DESIGN ELEMENTS .................................................................................................................. 8-5 DESIGN OPTIONS ................................................................................................................. 8-17 BEYOND THE TECHNOLOGY CHANGES ............................................................................ 8-18

9

COMBINE PHYSICAL AND VIRTUAL APPLIANCES IN A PRIVATE CLOUD................9-1 EXISTING NETWORK SERVICES DESIGN ............................................................................... 9-2 SECURITY REQUIREMENTS ..................................................................................................... 9-5 PRIVATE CLOUD INFRASTRUCTURE ...................................................................................... 9-6 NETWORK SERVICES IMPLEMENTATION OPTIONS ............................................................. 9-7 THE REALITY INTERVENES.................................................................................................... 9-10

10 HIGH-SPEED MULTI-TENANT ISOLATION ............................................................. 10-1 INTERACTION WITH THE PROVISIONING SYSTEM............................................................. 10-3 COMMUNICATION PATTERNS............................................................................................. 10-5 STATELESS OR STATEFUL TRAFFIC FILTERS? ....................................................................... 10-6 PACKET FILTERS ON LAYER-3 SWITCHES ........................................................................... 10-8

© Copyright ipSpace.net 2014

Page vii

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

PACKET FILTERS ON X86-BASED APPLIANCES................................................................... 10-9 CONCLUSIONS ................................................................................................................... 10-17

11 SCALE-OUT PRIVATE CLOUD INFRASTRUCTURE ................................................. 11-1 CLOUD INFRASTRUCTURE FAILURE DOMAINS .................................................................. 11-4 WORKLOAD MOBILITY CONSIDERATIONS ..................................................................... 11-14 CONCLUSIONS ................................................................................................................... 11-20

12 SIMPLIFY WORKLOAD MIGRATION WITH VIRTUAL APPLIANCES ....................... 12-1 EXISTING APPLICATION WORKLOADS .............................................................................. 12-2 INFRASTRUCTURE CHALLENGES ......................................................................................... 12-5 INCREASE WORKLOAD MOBILITY WITH VIRTUAL APPLIANCES ...................................... 12-6 BUILDING A NEXT GENERATION INFRASTRUCTURE ........................................................ 12-9 ORCHESTRATION CHALLENGES ....................................................................................... 12-13 CONCLUSIONS ................................................................................................................... 12-17

© Copyright ipSpace.net 2014

Page viii

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

TABLE OF FIGURES Figure 1-1: Network core and Internet edge ........................................................................ 1-2 Figure 2-1: Existing MPLS VPN WAN network topology .......................................................... 2-3 Figure 2-2: Proposed new network topology ........................................................................ 2-4 Figure 2-3: OSPF areas .................................................................................................... 2-5 Figure 2-4: OSPF-to-BGP route redistribution ...................................................................... 2-6 Figure 2-5: Inter-site OSPF route advertisements ................................................................ 2-7 Figure 2-6: DMVPN topology ............................................................................................. 2-9 Figure 2-7: OSPF areas in the Internet VPN....................................................................... 2-11 Figure 2-8: OSPF external route origination....................................................................... 2-12 Figure 2-9: Multiple OSPF processes with two-way redistribution .......................................... 2-13 Figure 2-10: BGP sessions in the WAN infrastructure .......................................................... 2-15 Figure 2-11: Single AS number used on all remote sites ..................................................... 2-19 Figure 2-12: BGP enabled on every layer-3 device between two BGP routers ......................... 2-21 Figure 2-13: BGP routing information redistributed into OSPF .............................................. 2-22 Figure 2-14: Dedicated VLAN between BGP edge routers .................................................... 2-23 Figure 2-15: Remote site logical network topology and routing ............................................ 2-27

© Copyright ipSpace.net 2014

Page ix

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 2-16: Central site logical network topology and BGP+OSPF routing ............................. 2-29 Figure 3-1: Planned DMVPN network .................................................................................. 3-3 Figure 3-2: BGP routing in existing WAN backbone ............................................................... 3-5 Figure 4-1: Redundant data centers and their internet connectivity ........................................ 4-3 Figure 4-2: Simplified topology with non-redundant internal components ................................ 4-4 Figure 4-3: BGP sessions between Internet edge routers and the ISPs. ................................... 4-6 Figure 4-4: Outside WAN backbone in the redesigned network ............................................... 4-9 Figure 4-5: Point-to-point Ethernet links implemented with EoMPLS on DCI routers ................ 4-12 Figure 4-6: Single stretched VLAN implemented with VPLS across L3 DCI .............................. 4-13 Figure 4-7: Two non-redundant stretched VLANs provide sufficient end-to-end redundancy ..... 4-14 Figure 4-8: Virtual topology using point-to-point links ........................................................ 4-15 Figure 4-9: Virtual topology using stretched VLANs ............................................................ 4-16 Figure 4-10: Full mesh of IBGP sessions between Internet edge routers ................................ 4-17 Figure 4-11: Virtual Device Contexts: dedicated management planes and physical interfaces ... 4-22 Figure 4-12: Virtual Routing and Forwarding tables: shared management, shared physical interfaces ...................................................................................................................... 4-23 Figure 4-13: BGP core in WAN backbone .......................................................................... 4-24 Figure 4-14: MPLS core in WAN backbone......................................................................... 4-25

© Copyright ipSpace.net 2014

Page x

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 4-15: Default routing in WAN backbone .................................................................. 4-27 Figure 5-1: Redundant data centers and their internet connectivity ........................................ 5-3 Figure 5-2: IP addressing and routing with external networks ................................................ 5-4 Figure 5-3: Simplified topology with non-redundant components............................................ 5-6 Figure 5-4: Primary/backup external routing ....................................................................... 5-9 Figure 5-5: OSPF routing used in enterprise WAN network .................................................. 5-12 Figure 5-6: EBGP and IBGP sessions on data center edge routers ......................................... 5-15 Figure 5-7: BGP local preference in prefix origination and propagation .................................. 5-17 Figure 5-8: BGP next hop processing ............................................................................... 5-18 Figure 7-1: Redundant server-to-network connectivity.......................................................... 7-2 Figure 7-2: Layer-2 fabric with two spine nodes ................................................................... 7-4 Figure 7-3: Layer-2 leaf-and-spine fabric using layer-2 ECMP technology ................................ 7-4 Figure 7-4: VMs pinned to a hypervisor uplink ..................................................................... 7-5 Figure 7-5: Server-to-network links bundled in a single LAG .................................................. 7-6 Figure 7-6: VM-to-uplink pinning with two hypervisor hosts connected to the same pair of ToR switches .......................................................................................................................... 7-7 Figure 7-7: Suboptimal traffic flow with VM-to-uplink pinning ................................................ 7-8 Figure 7-8: Traffic flow between orphan ports ..................................................................... 7-9

© Copyright ipSpace.net 2014

Page xi

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 7-9: LACP between a server and ToR switches ......................................................... 7-11 Figure 7-10: Optimal traffic flow with MLAG ...................................................................... 7-12 Figure 7-11: Redundant server connectivity requires the same IP subnet on adjacent ToR switches .................................................................................................................................... 7-13 Figure 7-12: A single uplink is used without server-to-ToR LAG ........................................... 7-15 Figure 7-13: All uplinks are used by a Linux host using balance-tlb bonding mode .................. 7-16 Figure 7-14: All ToR switches advertise IP subnets with the same cost.................................. 7-17 Figure 7-15: IP routing with stackable switches ................................................................. 7-18 Figure 7-16: Layer-2 fabric between hypervisor hosts ........................................................ 7-20 Figure 7-17: Optimal flow of balance-tlb traffic across a layer-2 fabric .................................. 7-21 Figure 7-18: LAG between a server and adjacent ToR switches ............................................ 7-22 Figure 8-19: Packet filters protecting individual servers ........................................................ 8-6 Figure 8-20: VM NIC firewalls ........................................................................................... 8-9 Figure 8-21: Per-application firewalls ............................................................................... 8-12 Figure 8-22: High-performance WAN edge packet filters combined with a proxy server ........... 8-15 Figure 9-1: Centralized network services implemented with physical appliances ....................... 9-3 Figure 9-2: Centralized network services implemented with physical appliances ....................... 9-4 Figure 9-3: Applications accessing external resources ........................................................... 9-5

© Copyright ipSpace.net 2014

Page xii

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 9-4: Hybrid architecture combining physical and virtual appliances ............................. 9-11 Figure 10-1: Containers and data center backbone ............................................................ 10-2 Figure 10-2: Interaction with the provisioning/orchestration system ..................................... 10-4 Figure 10-3: Traffic control appliances ........................................................................... 10-10 Figure 10-4: Layer-3 traffic control devices ..................................................................... 10-12 Figure 10-5: Bump-in-the-wire traffic control devices ....................................................... 10-13 Figure 10-6: Routing protocol adjacencies across traffic control appliances .......................... 10-14 Figure 11-1: Standard cloud infrastructure rack ................................................................. 11-2 Figure 11-2: Planned WAN connectivity ............................................................................ 11-3 Figure 11-3: Cloud infrastructure components ................................................................... 11-5 Figure 11-4: Single orchestration system used to manage multiple racks .............................. 11-9 Figure 11-5: VLAN transport across IP infrastructure ........................................................ 11-13 Figure 12-1: Some applications use application-level load balancing solutions ........................ 12-3 Figure 12-2: Typical workload architecture with network services embedded in the application stack .................................................................................................................................... 12-3 Figure 12-3: Most applications use external services .......................................................... 12-4 Figure 12-4: Application tiers are connected through central physical appliances .................... 12-5 Figure 12-5: Virtual appliance NIC connected to overlay virtual network ............................... 12-8

© Copyright ipSpace.net 2014

Page xiii

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 12-6: Virtual router advertises application-specific IP prefix via BGP ......................... 12-15

© Copyright ipSpace.net 2014

Page xiv

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

FOREWORD Ivan Pepelnjak first came onto my radar in 2001, when I was tasked with migrating a large multinational network from IGRP to EIGRP. As a CCIE I was (over)confident in my EIGRP abilities. I had already deployed EIGRP for a smaller organization; how different could this new challenge be? A few months into the project, I realized that designing a large-scale EIGRP network was quite different from configuring a small one. Fortunately I stumbled across Ivan’s EIGRP Network Design Solutions book. So began a cycle which continues to this day – I take on a new project, look for a definitive resource to understand the technologies, and discover that Ivan is the authoritative source. MPLS, L3VPN, IS-IS… Ivan has covered it all! Several years ago I was lucky enough to meet Ivan in person through my affiliation with Gestalt IT’s Tech Field Day program. We also ‘shared the mic’ via the Packet Pushers Podcast on several occasions. Through these opportunities I discovered Ivan to be a remarkably thoughtful collaborator. He has a knack for asking the exact right question to direct your focus to the specific information you need. Some of my favorite interactions with Ivan center on his answering my ‘could I do this?’ inquiry with a ‘yes, it is possible, but you don’t want to do that because…’ response. For a great example of this, take a look at “OSPF as the Internet VPN Routing Protocol” section in chapter 2 of this book. I have found during my career as a network technology instructor that the case studies are the best method for teaching network design. Presenting an actual network challenge and explaining the thought process (including rejected solutions) greatly assists students in building the required skill base to create their own scalable designs. This book uses this structure to explain diverse Enterprise design challenges, from DMVPN to Data Centers to Internet routing. Over the next few hours of

© Copyright ipSpace.net 2014

Page xv

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

reading you will accompany Ivan on many real-world consulting assignments. You have the option of implementing the designs as presented (I can assure you they work ‘out of the box’!), or you can use the rich collection of footnotes and references to customize the solution to your exact needs. In either event, I am confident that you will find these case studies as useful as I have found them to be.

Jeremy Filliben Network Architect / Trainer CCDE#20090003, CCIE# 3851

© Copyright ipSpace.net 2014

Page xvi

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

INTRODUCTION I started the ExpertExpress experiment a few years ago and it was unexpectedly successful; I was amazed at how many people decided to ask me to help design or troubleshoot their network. Most of the engagements touched at least one data center element, be it server virtualization, data center network core, WAN edge, or connectivity between data centers and customer sites or public Internet. I also noticed the same challenges appearing over and over, and decided to document them in a series of ExpertExpress case studies, which eventually resulted in this book. The book has two major parts: data center WAN edge and WAN connectivity, and internal data center infrastructure. In the first part, I’ll walk you through common data center WAN edge challenges: 

Optimizing BGP routing on data center WAN edge routers to reduce the downtime and brownouts following link or node failures (chapter 1);



Integrating MPLS/VPN network provided by one or more service providers with DMVPN-overInternet backup network (chapter 2);



Building large-scale DMVPN network connecting one or more data centers with thousands of remote sites (chapter 3);



Implementing redundant data center connectivity and routing between active/active data centers and the outside world (chapter 4);



External routing combined with layer-2 data center interconnect (chapter 5).

© Copyright ipSpace.net 2014

Page xvii

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

The data center infrastructure part of the book covers these topics: 

Designing a private cloud network infrastructure (chapter 6);



Redundant server-to-network connectivity (chapter 7);



Replacing the central firewall with a scale-out architecture combining packet filters, virtual intersubnet firewalls and VM NIC firewalls (chapter 8);



Combining physical and virtual appliances in a private cloud (chapter 9);



High-speed multi-tenant isolation (chapter 10);

The final part of the book covers scale-out architectures, multiple data centers and disaster recovery: 

Scale-out private cloud infrastructure using standardized building blocks (chapter 11);



Simplified workload migration and disaster recovery with virtual appliances (chapter 12);



Active-active data centers and scale-out application architectures (chapter 13 – coming in late 2014);

I hope you’ll find the selected case studies useful. Should you have any follow-up questions, please feel free to send me an email (or use the contact form @ ipSpace.net/Contact); I’m also available for short online consulting engagements. Happy reading! Ivan Pepelnjak September 2014

© Copyright ipSpace.net 2014

Page xviii

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

1

BGP CONVERGENCE OPTIMIZATION

IN THIS CHAPTER: BRIEF NETWORK DESCRIPTION SOLUTION – EXECUTIVE OVERVIEW DETAILED SOLUTION ENABLE BFD ENABLE BGP NEXT HOP TRACKING REDUCE THE BGP UPDATE TIMERS REDUCE THE NUMBER OF BGP PREFIXES BGP PREFIX INDEPENDENT CONVERGENCE

CONCLUSIONS

© Copyright ipSpace.net 2014

Page 1-1

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

A large multi-homed content provider has experienced a number of outages and brownouts in the Internet edge of their data center network. The brownouts were caused by high CPU load on the Internet edge routers, leading to unstable forwarding tables and packet loss after EBGP peering session loss. This document describes the steps the customer could take to improve the BGP convergence and reduce the duration of Internet connectivity brownouts.

Figure 1-1: Network core and Internet edge

© Copyright ipSpace.net 2014

Page 1-2

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

BRIEF NETWORK DESCRIPTION The customer’s data center has two Internet-facing edge routers, each of them connected to a different ISP through a 1GE uplink. Both routers are dual-attached to core switches (see Figure 11). ISP-A is the primary ISP; connection to ISP-B is used only when the uplink to ISP-A fails. Edge routers (GW-A and GW-B) have EBGP sessions with ISPs and receive full Internet routing (~450.000 BGP prefixes1). GW-A and GW-B exchange BGP routes over an IBGP session to ensure consistent forwarding behavior. GW-A has higher default local preference; GW-B thus always prefers IBGP routes received from GW-A over EBGP routes. Core routers (Core-1 and Core-2) don’t run BGP; they run OSPF with GW-A and GW-B, and receive default route from both Internet edge routes (the details of default route origination are out of scope).

SOLUTION – EXECUTIVE OVERVIEW The temporary blackout and prolonged brownouts following an Internet uplink loss are caused by BGP convergence issues. Like with any other routing protocol, a router running BGP has to take the following steps to adapt the forwarding tables to link or neighbor loss: 1. BGP routing process detects a link or neighbor loss. 2. Invalid routes are removed from the local BGP, routing and forwarding tables. Alternate routes already present in BGP table could be installed at this point.

1

BGP Routing Table Analysis Reports http://bgp.potaroo.net/

© Copyright ipSpace.net 2014

Page 1-3

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

3. Updates are sent to other BGP neighbors withdrawing the lost routes. 4. BGP neighbors process the withdrawal updates, select alternate BGP best routes, and install them in their routing and forwarding tables. 5. BGP neighbors advertise their new best routes. 6. The router processes incoming BGP updates, selects new best routes, and installs them in routing and forwarding tables. Neighbor loss detection can be improved with Bidirectional Forwarding Detection (BFD)2, fast neighbor failover3 or BGP next-hop tracking. BGP update propagation can be fine-tuned with BGP update timers. The other elements of the BGP convergence process are harder to tune; they depend primarily on the processing power of routers’ CPU, and the underlying packet forwarding hardware. Some router vendors offer functionality that can be used to pre-install backup paths in BGP tables (BGP best external paths) and forwarding tables (BGP Prefix Independent Convergence4). These features can be used to redirect the traffic to the backup Internet connection even before the BGP convergence process is complete. Alternatively, you can significantly reduce the CPU load of the Internet edge routes, and improve the BGP convergence time, by reducing the number of BGP prefixes accepted from the upstream ISPs.

2

Bidirectional Forwarding Detection http://wiki.nil.com/Bidirectional_Forwarding_Detection_(BFD)

3

Fast BGP Neighbor Loss Detection http://wiki.nil.com/Fast_BGP_neighbor_loss_detection

4

Prefix Independent Convergence – Fixing the FIB Bottleneck http://blog.ipspace.net/2012/01/prefix-independent-convergence-pic.html

© Copyright ipSpace.net 2014

Page 1-4

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Finally, you might need to replace your Internet edge routers with devices that have processing power matching today’s Internet routing table sizes.

DETAILED SOLUTION The following design or configuration changes can be made to improve BGP convergence process: 

Enable BFD on EBGP sessions



Enable BFD on IBGP sessions



Enable BGP next-hop tracking



Reduce the BGP update timers



Reduce the number of EBGP prefixes



Enable BGP Prefix Independent Convergence (if available). Design and configuration changes described in this document might be disruptive and might result in temporary or long-term outages. Always prepare a deployment and rollback plan, and change your network configuration during a maintenance window. You can use the ExpertExpress service for a design/deployment check, design review, or a second opinion.

ENABLE BFD Bidirectional Forwarding Detection (BFD) has been available in major Cisco IOS and Junos software releases for several years. Service providers prefer BFD over BGP hold time adjustments because the high-end routers process BFD on the linecard, whereas BGP hold timer relies on BGP process (running on the main CPU) sending keepalive packets over BGP TCP session.

© Copyright ipSpace.net 2014

Page 1-5

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

BFD has to be supported and configured on both ends of a BGP session; check with your ISP before configuring BFD on your Internet-facing routers. To configure BFD with BGP, use the following configuration commands on Cisco IOS: interface bfd interval min_rx multiplier ! router bgp 65000 neighbor remote-as neighbor fall-over bfd Although you can configure BFD timers in milliseconds range, don’t set them too low. BFD should detect a BGP neighbor loss in a few seconds; you wouldn’t want a short-term link glitch to start CPU-intensive BGP convergence process. Cisco IOS and Junos support BFD on EBGP sessions. BFD on IBGP sessions is available Junos release 8.3. Multihop BFD is available in Cisco IOS, but there’s still no support for BFD on IBGP sessions.

ENABLE BGP NEXT HOP TRACKING BGP next hop tracking removes routes from BGP table (and subsequently IP routing table and forwarding table) a few seconds after the BGP next hop becomes unreachable. BGP next hop tracking deployed on GW-B could trigger the BGP best path selection even before GW-B starts receiving BGP withdrawn routes update messages from GW-A.

© Copyright ipSpace.net 2014

Page 1-6

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

BGP next-hop tracking is enabled by default on Cisco IOS; you can adjust the tracking interval with the bgp nexthop trigger delay router configuration command. In environments using default routing, you should limit the valid prefixes that can be used for BGP next hop tracking with the bgp nexthop route-map router configuration command. If you want to use BGP next hop tracking in the primary/backup Internet access scenario described in this document: 

Do not change the BGP next hop on IBGP updates with neighbor next-hop-self router configuration command. Example: routes advertised from GW-A to GW-B must have the original next-hop from the ISP-A router.



Advertise IP subnets of ISP uplinks into IGP (example: OSPF) from GW-A and GW-B.



Use a route-map with BGP next hop tracking to prevent the default route advertised by GW-A and GW-B from being used as a valid path toward external BGP next hop.

When the link between GW-A and ISP-A fails, GW-A revokes the directly-connected IP subnet from its OSPF LSA, enabling GW-B to start BGP best path selection process before it receives BGP updates from GW-A. BGP next-hop tracking detects link failures that result in loss of IP subnet. It cannot detect EBGP neighbor failure unless you combine it with BFD-based static routes.

© Copyright ipSpace.net 2014

Page 1-7

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

REDUCE THE BGP UPDATE TIMERS BGP update timers (the interval between consecutive BGP updates) are configured for individual neighbors, peer groups, or peer templates. The default IBGP value used by Cisco IOS was 5 seconds (updates were sent to IBGP neighbors every 5 seconds). This value was reduced to zero (updates are sent immediately) in Cisco IOS releases 12.2SR, 12.4T and 15.0. BGP update timers adjustment should be one of the last steps in the convergence tuning process; in most scenarios you’ll gain more by reducing the number of BGP prefixes in accepted by the Internet edge routers.

REDUCE THE NUMBER OF BGP PREFIXES Global Internet routing tables contain almost 450.000 prefixes, most of them irrelevant to content providers with localized content. Reducing the number of BGP prefixes in the BGP table can significantly reduce the CPU load after a link or neighbor loss, and thus drastically improve BGP convergence time. This solution is ideal if one could guarantee that all upstream providers always have visibility of all Internet destinations. In case of a peering dispute, that might not be true, and your network might potentially lose connectivity to some far-away destinations. It’s impossible to document a generic BGP prefix filtering policy. You should always accept prefixes originated by upstream ISPs, their customers, and their peering partners. In most cases, filters based on AS-path lengths work well (example: accept prefixes that have no more than three distinct

© Copyright ipSpace.net 2014

Page 1-8

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

AS numbers in the AS path). Some ISPs attach BGP communities to BGP prefixes they advertise to their customers to help the customers implement well-tuned filters5. When building an AS-path filter, consider the impact of AS path prepending on your AS-path filter and use regular expressions that can match the same AS number multiple times6. Example: matching up to three AS numbers in the AS path might not be good enough, as another AS might use AS-path prepending to enforce primary/backup path selection7. After deploying inbound BGP update filters, your autonomous system no longer belongs to the default-free zone8 – your Internet edge routers need default routes from the upstream ISPs to reach destinations that are no longer present in their BGP tables. BGP default routes could be advertised by upstream ISPs, requiring no further configuration on the Internet edge routers.

5

BGP Community Guides http://onesc.net/communities/

6

Filter Excessively-Prepended AS paths http://wiki.nil.com/Filter_excessively_prepended_BGP_paths

7

BGP Essentials: AS Path Prepending http://blog.ipspace.net/2008/02/bgp-essentials-as-path-prepending.html

8

Default-free zone http://en.wikipedia.org/wiki/Default-free_zone

© Copyright ipSpace.net 2014

Page 1-9

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

If the upstream ISPs don’t advertise BGP default routes, or if you can’t trust the ISPs to perform responsible default route origination9, use local static default routes pointing to far-away next hops. Root name servers are usually a suitable choice. The default routes on the Internet edge routers should use next-hops that are far away to ensure the next hop reachability reflects the health status of upstream ISP’s network. The use of root DNS servers as next hops of static routes does not mean that the traffic will be sent to the root DNS servers, just toward them.

BGP PREFIX INDEPENDENT CONVERGENCE BGP PIC is a feature that allows a router to pre-install alternate routes to BGP destinations in its forwarding table. The drastic changes caused by external link failure or EBGP session failure are thus easier to implement in the forwarding table. Furthermore, the forwarding tables can be changed even before the final route selection is performed in the BGP table. BGP PIC is a recently-introduced feature that does not necessarily interoperate with all other BGP features one might want to use. Its deployment and applicability are left for further study.

9

Responsible Generation of BGP Default Route http://blog.ipspace.net/2011/09/responsible-generation-of-bgp-default.html

© Copyright ipSpace.net 2014

Page 1-10

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

CONCLUSIONS BGP neighbor loss detection can be significantly improved by deploying Bidirectional Forwarding Detection (BFD). Backup Internet edge router can use BGP next-hop tracking to detect primary uplink loss and adjust its forwarding tables before receiving BGP updates from the primary Internet edge router. To reduce the CPU overload and slow convergence caused by massive changes in the BGP, routing and forwarding tables following a link or EBGP session failure: 

Reduce the number of BGP prefixes accepted by the Internet edge routers;



Upgrade the Internet edge routers to a more CPU-capable platform.

© Copyright ipSpace.net 2014

Page 1-11

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

2

INTEGRATING INTERNET VPN WITH MPLS/VPN WAN

IN THIS CHAPTER: IP ROUTING OVERVIEW DESIGN REQUIREMENTS SOLUTION OVERVIEW OSPF AS THE INTERNET VPN ROUTING PROTOCOL BENEFITS AND DRAWBACKS OF OSPF IN INTERNET VPN

BGP AS THE INTERNET VPN ROUTING PROTOCOL IBGP OR EBGP? AUTONOMOUS SYSTEM NUMBERS INTEGRATION WITH LAYER-3 SWITCHES SUMMARY OF DESIGN CHOICES © Copyright ipSpace.net 2014

Page 2-1

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

BGP-BASED WAN NETWORK DESIGN AND IMPLEMENTATION GUIDANCE REMOTE SITES CENTRAL SITE INTERNET VPN ROUTING POLICY ADJUSTMENTS

CONCLUSIONS

© Copyright ipSpace.net 2014

Page 2-2

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

A large enterprise (the Customer) has a WAN backbone based on MPLS/VPN service offered by a regional Service Provider (SP). The service provider has deployed Customer Premises Equipment (CPE) routers at remote sites. Customer routers at the central site are connected directly to the SP Provider Edge (PE) routers with 10GE uplinks as shown in Figure 2-1.

Figure 2-1: Existing MPLS VPN WAN network topology

The traffic in the Customer’s WAN network has been increasing steadily prompting the customer to increase the MPLS/VPN bandwidth or to deploy an alternate VPN solution. The Customer decided to trial IPsec VPN over the public Internet, initially as a backup, and potentially as the primary WAN connectivity solution. The customer will deploy new central site routers to support the IPsec VPN service. These routers will terminate the IPsec VPN tunnels and provide whatever other services are needed (example: QoS, routing protocols) to the IPsec VPNs.

© Copyright ipSpace.net 2014

Page 2-3

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

New low-end routers connected to the existing layer-3 switches will be deployed at the remote sites to run the IPsec VPN (Figure 2-2 shows the proposed new network topology).

Figure 2-2: Proposed new network topology

© Copyright ipSpace.net 2014

Page 2-4

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

IP ROUTING OVERVIEW The customer is using OSPF as the sole routing protocol and would prefer using OSPF in the new IPsec VPN. OSPF routes are exchanged between Customer’s core routers and SP’s PE routers, and between Customer’s layer-3 switches and SP’s CPE routers at remote sites. Customer’s central site is in OSPF area 0; all remote sites belong to OSPF area 51.

Figure 2-3: OSPF areas

The only external connectivity remote customer sites have is through the MPLS/VPN SP backbone – the OSPF area number used at those sites is thus irrelevant and the SP chose to use the same OSPF area on all sites to simplify the CPE router provisioning and maintenance.

© Copyright ipSpace.net 2014

Page 2-5

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

CPE routers deployed at Customer’s remote sites act as Customer Edge (CE) routers from MPLS/VPN perspective. The Service Provider uses BGP as the routing protocol between its PE- and CE routers, redistributing BGP routes into OSPF at the CPE routers for further propagation into Customer’s remote sites. OSPF routes received from the customer equipment (central site routers and remote site layer-3 switches) are redistributed into BGP used by the SP’s MPLS/VPN service, as shown in Figure 2-4.

Figure 2-4: OSPF-to-BGP route redistribution The CPE routers redistributing remote site OSPF routes into SP’s BGP are not PE routers. The OSPF routes that get redistributed into BGP thus do not have OSPF-specific extended BGP communities, lacking any indication that they came from an OSPF routing process. These routes are therefore redistributed as external OSPF routes into the central site’s OSPF routing process by the SP’s PE routers. The OSPF routes advertised to the PE routers from the central site get the extended BGP communities when they’re redistributed into MP-BGP, but since the extended VPNv4 BGP

© Copyright ipSpace.net 2014

Page 2-6

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

communities don’t propagate to CE routers running BGP with the PE routers, the CPE routers don’t receive the extended communities indicating the central site routes originated as OSPF routes. The CPE routers thus redistribute routes received from other Customer’s sites as external OSPF routes into the OSPF protocol running at remote sites. Summary: All customer routes appear as external OSPF routes at all other customer sites (see Figure 2-5 for details).

Figure 2-5: Inter-site OSPF route advertisements

© Copyright ipSpace.net 2014

Page 2-7

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

DESIGN REQUIREMENTS The VPN-over-Internet solution must satisfy the following requirements: 

Dynamic routing: the solution must support dynamic routing over the new VPN infrastructure to ensure fast failover on MPLS/VPN or Internet VPN failures;



Flexible primary/backup configuration: Internet VPN will be used as a backup path until it has been thoroughly tested. It might become the primary connectivity option in the future;



Optimal traffic flow: Traffic to/from sites reachable only over the Internet VPN (due to local MPLS/VPN failures) should not traverse the MPLS/VPN infrastructure. Traffic between an MPLS/VPN-only site and an Internet VPN-only site should traverse the central site;



Hub-and-spoke or peer-to-peer topology: Internet VPN will be used in a hub-and-spoke topology (hub = central site). The topology will be migrated to a peer-to-peer (any-to-any) overlay network when the Internet VPN becomes the primary WAN connectivity solution.



Minimal configuration changes: Deployment of Internet VPN connectivity should not require major configuration changes in the existing remote site equipment. Central site routers will probably have to be reconfigured to take advantage of the new infrastructure.



Minimal disruption: The introduction of Internet VPN connectivity must not disrupt the existing WAN network connectivity.



Minimal dependence on MPLS/VPN provider: After the Internet VPN infrastructure has been established and integrated with the existing MPLS/VPN infrastructure (which might require configuration changes on the SP-managed CPE routers), the changes in the traffic flow must not require any intervention on the SP-managed CPE routers.

© Copyright ipSpace.net 2014

Page 2-8

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

SOLUTION OVERVIEW Internet VPN will be implemented with the DMVPN technology to meet the future requirements of peer-to-peer topology. Each central site router will be a hub router in its own DMVPN subnet (one hub router per DMVPN subnet), with the remote site routers having two DMVPN tunnels (one for each central site hub router) as shown in Figure 2-6.

Figure 2-6: DMVPN topology Please refer to the DMVPN: From Basics to Scalable Networks and DMVPN Designs webinars for more DMVPN details. This case study focuses on the routing protocol design considerations.

© Copyright ipSpace.net 2014

Page 2-9

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

The new VPN infrastructure could use OSPF or BGP routing protocol. The Customer would prefer to use OSPF, but the design requirements and the specifics of existing MPLS/VPN WAN infrastructure make OSPF deployment exceedingly complex. Using BGP as the Internet VPN routing protocol would introduce a new routing protocol in the Customer’s network. While the network designers and operations engineers would have to master a new technology (on top of DMVPN) before production deployment of the Internet VPN, the reduced complexity of BGP-only WAN design more than offsets that investment.

OSPF AS THE INTERNET VPN ROUTING PROTOCOL A network designer would encounter major challenges when trying to use OSPF as the Internet VPN routing protocol: 1. Routes received through MPLS/VPN infrastructure are inserted as external OSPF routes into the intra-site OSPF routing protocol. Routes received through Internet VPN infrastructure must be worse than the MPLS/VPN-derived OSPF routes, requiring them to be external routes as well. 2. MPLS/VPN- and Internet VPN routers must use the same OSPF external route type to enable easy migration of the Internet VPN from backup to primary connectivity solution. The only difference between the two sets of routes should be their OSPF metric. 3. Multiple sites must not be in the same area. The OSPF routing process would prefer intra-area routes (over Internet VPN infrastructure) to MPLS/VPN routes in a design with multiple sites in the same area. 4. Even though each site must be at least an independent OSPF area, every site must use the same OSPF area number to preserve the existing intra-site routing protocol configuration.

© Copyright ipSpace.net 2014

Page 2-10

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Challenges #3 and #4 significantly limit the OSPF area design options. Remote site OSPF areas cannot extend to the Internet VPN hub router – the hub router would automatically merge multiple remote sites into the same OSPF area. Every remote site router must therefore be an Area Border Router (ABR) or Autonomous System Border Router (ASBR). The only design left is an OSPF backbone area spanning the whole Internet VPN.

Figure 2-7: OSPF areas in the Internet VPN The requirement to advertise site routes as external OSPF routes further limits the design options. While the requirements could be met by remote site and core site layer-3 switches advertising directly connected subnet (server and client subnets) as external OSPF routes (as shown in Figure 2-8), such a design requires configuration changes on subnet-originating switch whenever you want to adjust the WAN traffic flow (which can only be triggered by changes in OSPF metrics). © Copyright ipSpace.net 2014

Page 2-11

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 2-8: OSPF external route origination The only OSPF design that would meet the OSPF constraints listed above and the design requirements (particularly the minimal configuration changes and minimal disruption requirements) is a design displayed in Figure 2-9 where: 

Every site runs an independent copy of the OSPF routing protocol;



Internet VPN WAN network runs a separate OSPF process;



Internet VPN edge routers perform two-way redistribution between intra-site OSPF process and Internet VPN OSPF process.

© Copyright ipSpace.net 2014

Page 2-12

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 2-9: Multiple OSPF processes with two-way redistribution

BENEFITS AND DRAWBACKS OF OSPF IN INTERNET VPN There’s a single benefit of running OSPF over the Internet VPN: familiarity with an existing routing protocol and mastery of configuration and troubleshooting procedures. The drawbacks are also exceedingly clear: the only design that meets all the requirements is complex as it requires multiple OSPF routing processes and parallel two-way redistribution (site-toMPLS/VPN and site-to-Internet VPN) between multiple routing domains.

© Copyright ipSpace.net 2014

Page 2-13

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

It’s definitely possible to get such a design implemented with safety measures that would prevent redistribution (and traffic forwarding) loops, but it’s definitely not an error-resilient design – minor configuration changes or omissions could result in network-wide failures.

BGP AS THE INTERNET VPN ROUTING PROTOCOL BGP-only WAN network design extends the existing BGP routing protocol running within the Service Provider’s MPLS/VPN network and between the PE- and CPE routers to all WAN routers. As shown in Figure 2-10 BGP sessions would be established between: 

Remote site CPE routers and adjacent Internet VPN routers;



Central site WAN edge routers (MPLS/VPN CE routers and Internet VPN routers);



Central site CE routers and SP’s PE routers;



Remote site Internet VPN routers and central site Internet VPN routers.

© Copyright ipSpace.net 2014

Page 2-14

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 2-10: BGP sessions in the WAN infrastructure BGP local preference (within a single autonomous system) or Multi-Exit Discriminator (across autonomous systems) would be used to select the optimum paths, and BGP communities would be used to influence local preference between autonomous systems. The BGP-only design seems exceedingly simple, but there are still a number of significant design choices to make: 

IBGP or EBGP sessions: Which routers would belong to the same autonomous system (AS)? Would the network use one AS per site or would a single AS span multiple sites?



Autonomous system numbers: There are only 1024 private AS numbers. Would the design reuse a single AS number on multiple sites or would each site has a unique AS number?

© Copyright ipSpace.net 2014

Page 2-15

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars



Integration with CPE routers: Would the Internet VPN routers use the same AS number as the CPE routers on the same site?



Integration with layer-3 switches: Would the central site and remote site layer-3 switches participate in BGP or would they interact with the WAN edge routers through OSPF?

IBGP OR EBGP? There are numerous differences between EBGP and IBGP and their nuances sometimes make it hard to decide whether to use EBGP or IBGP in a specific scenario. However, you the following guidelines usually result in simple and stable designs: 

If you plan to use BGP as the sole routing protocol in (a part of) your network, use EBGP.



If you’re using BGP in combination with another routing protocol that will advertise reachability of BGP next hops, use IBGP. You can also use IBGP between routers residing in a single subnet.



It’s easier to implement routing policies with EBGP. Large IBGP deployments need route reflectors for scalability and some BGP implementations don’t apply BGP routing policies on reflected routes.



All routers in the same AS should have the same view of the network and the same routing policies.



EBGP should be used between routers in different administrative (or trust) domains.

Applying these guidelines to our WAN network gives the following results: 

EBGP will be used across DMVPN network. A second routing protocol running over DMVPN would be needed to support IBGP across DMVPN, resulting in overly complex network design.

© Copyright ipSpace.net 2014

Page 2-16

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars



IBGP will be used between central site WAN edge routers. The existing central site routing protocol can be used to propagate BGP next hop information between WAN edge routers (or they could belong the same layer-2 subnet).



EBGP will be used between central site MPLS/VPN CE routers and Service Provider’s PE routers (incidentally, most MPLS/VPN implementations don’t support IBGP as the PE-CE routing protocol).



EBGP or IBGP could be used between remote site Internet VPN routers and CPE routers. While IBGP between these routers reduces the overall number of autonomous systems needed, the MPLS/VPN service provider might insist on using EBGP. Throughout the rest of this document we’ll assume the Service Provider agreed to use IBGP between CPE routers and Internet VPN routers on the same remote site.

AUTONOMOUS SYSTEM NUMBERS The decision to use IBGP between CPE routers and Internet VPN routers simplifies the AS number decision: remote sites will use the existing AS numbers assigned to CPE routers. The Customer has to get an extra private AS number (coordinated with the MPLS/VPN SP) for the central site, or use a public AS number for that site. In a scenario where the SP insists on using EBGP between CPE routers and Internet VPN routers the Customer has two options: 

Reuse a single AS number for all remote sites even though each site has to be an individual AS;

© Copyright ipSpace.net 2014

Page 2-17

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars



Use a set of private AS numbers that the MPLS/VPN provider isn’t using on its CPE routers and number the remote sites;



Use 4-octet AS numbers reserved for private use by RFC 6996.

Unless you’re ready to deploy 4-octet AS numbers, the first option is the only viable option for networks with more than a few hundred remote sites (because there are only 1024 private AS numbers). The second option is feasible for smaller networks with a few hundred remote sites. The last option is clearly the best one, but requires router software with 4-octet AS number support (4-octet AS numbers are supported by all recent Cisco and Juniper routers). Routers using 4-octet AS numbers (defined in RFC 4893) can interoperate with legacy routers that don’t support this BGP extension; Service Provider’s CPE routers thus don’t have to support 4-byte AS numbers (customer routers would appear to belong to AS 23456). Default loop prevention filters built into BGP reject EBGP updates with local AS number in the AS path, making it impossible to pass routes between two remote sites when they use the same AS number. If you have to reuse the same AS number on multiple remote sites, disable the BGP loop prevention filters as shown in Figure 2-11 (using neighbor allowas-in command on Cisco IOS). While you could use default routing from the central site to solve this problem, the default routing solution cannot be used when you have to implement the any-to-any traffic flow requirement.

© Copyright ipSpace.net 2014

Page 2-18

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 2-11: Single AS number used on all remote sites Some BGP implementations might filter outbound BGP updates, omitting BGP prefixes with AS number of the BGP neighbor in the AS path from the updates sent to that neighbor. Cisco IOS does not contain outbound filters based on neighbor AS number; if you use routers from other vendors, check the documentation.

© Copyright ipSpace.net 2014

Page 2-19

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

INTEGRATION WITH LAYER-3 SWITCHES In a typical BGP-based network all core routers (and layer-3 switches) run BGP to get a consistent view of the forwarding information. At the very minimum, all layer-3 elements in every possible path between two BGP routers have to run BGP to be able to forward IP datagrams between the BGP routers as illustrated in Figure 2-12. There are several workarounds you can use when dealing with non-BGP devices in the forwarding path: 

Redistribute BGP routes into IGP (example: OSPF). Non-BGP devices in the forwarding path thus receive BGP information through their regular IGP (see Figure 2-13).

© Copyright ipSpace.net 2014

Page 2-20

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 2-12: BGP enabled on every layer-3 device between two BGP routers

© Copyright ipSpace.net 2014

Page 2-21

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 2-13: BGP routing information redistributed into OSPF 

Enable MPLS forwarding. Ingress network edge devices running BGP label IP datagrams with MPLS labels assigned to BGP next hops to ensure the datagrams get delivered to the proper egress device; intermediate nodes perform label lookup, not IP lookup, and thus don’t need the full IP forwarding information.



Create a dedicated layer-2 subnet (VLAN) between BGP edge routers and advertise default route to other layer-3 devices as shown in Figure 2-14. This design might result in suboptimal

© Copyright ipSpace.net 2014

Page 2-22

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

routing, as other layer-3 devices forward IP datagrams to the nearest BGP router, which might not be the optimal exit point.

Figure 2-14: Dedicated VLAN between BGP edge routers We’ll extend BGP to core layer-3 switches on the central site (these switches will also act as BGP route reflectors) and use a VLAN between Service Provider’s CPE router and Internet VPN router on remote sites.

© Copyright ipSpace.net 2014

Page 2-23

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

SUMMARY OF DESIGN CHOICES The following parameters will be used in the BGP-based WAN network design: 

Each site is an independent autonomous system;



Each site uses a unique AS number assigned to it by the MPLS/VPN SP;



IBGP will be used between routers within the same site;



EBGP will be used between sites;



Remote site layer-3 switches will continue to use OSPF as the sole routing protocol;



Core central site layer-3 switches will participate in BGP routing and will become BGP route reflectors.

BGP-BASED WAN NETWORK DESIGN AND IMPLEMENTATION GUIDANCE The following sections describe individual components of BGP-based WAN network design.

REMOTE SITES Internet VPN router will be added to each remote site. It will be in the same subnet as the existing CPE router. Remote site layer-3 switch might have to be reconfigured if it used layer-3 physical interface on the port to which the CPE router was connected. Layer-3 switch should use a VLAN (or SVI) interface to connect to the new router subnet.

© Copyright ipSpace.net 2014

Page 2-24

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

IBGP session will be established between CPE router and adjacent Internet VPN router. This is the only modification that has to be has to be performed on the CPE router. Internet VPN router will redistribute internal OSPF routes received from the layer-3 switch into BGP. External OSPF routes will not be redistributed, preventing routing loops between BGP and OSPF. The OSPF-to-BGP route redistribution does not impact existing routing, as the CPE router already does it; it’s configured on the Internet VPN router solely to protect the site against CPE router failure. Internet VPN router will redistribute EBGP routes into OSPF (redistribution of IBGP routes is disabled by default on most router platforms). OSPF external route metric will be used to influence the forwarding decision of the adjacent layer-3 switch. OSPF metric of redistributed BGP routes could be hard-coded into the Internet VPN router confirmation or based on BGP communities attached to EBGP routes. The BGP communitybased approach is obviously more flexible and will be used in this design. The following routing policies will be configured on the Internet VPN routers: 

EBGP routes with BGP community 65000:1 (Backup route) will get local preference 50. These routes will be redistributed into OSPF as external type 2 routes with metric 10000.



EBGP routes with BGP community 65000:2 (Primary route) will get local preference 150. These routes will be redistributed into OSPF as external type 1 routes with metric 1.

Furthermore, the remote site Internet VPN router has to prevent potential route leakage between MPLS/VPN and Internet VPN WAN networks. A route leakage between the two WAN networks might turn one or more remote sites into transit sites forwarding traffic between the two WAN networks.

© Copyright ipSpace.net 2014

Page 2-25

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

NO-EXPORT BGP community will be used on the Internet VPN router to prevent the route leakage: 

NO-EXPORT community will be set on updates sent over the IBGP session to the CPE router, preventing the CPE router from advertising routes received from the Internet VPN router into the MPLS/VPN WAN network.



NO-EXPORT community will be set on updates received over the IBGP session from the CPE router, preventing leakage of these updates into the Internet VPN WAN network.

Figure 2-15 summarizes the remote site design.

© Copyright ipSpace.net 2014

Page 2-26

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 2-15: Remote site logical network topology and routing

CENTRAL SITE The following steps will be used to deploy BGP on the central site: 1. BGP will be configured on existing MPLS/VPN edge routers, on the new Internet VPN edge routers, and on the core layer-3 switches.

© Copyright ipSpace.net 2014

Page 2-27

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

2. IBGP sessions will be established between all loopback interfaces of WAN edge switches and both core layer-3 switches10. Core layer-3 switches will be BGP route reflectors. 3. EBGP sessions will be established between MPLS/VPN edge routers and adjacent PE routers. 4. BGP community propagation11 will be configured on all IBGP and EBGP sessions. After this step, the central site BGP infrastructure is ready for routing protocol migration. 5. Internal OSPF routes will be redistributed into BGP on both core layer-3 switches. No other central site router will perform route redistribution. At this point, the PE routers start receiving central site routes through PE-CE EBGP sessions and prefer EBGP routes received from MPLS/VPN edge routes over OSPF routes received from the same routers. 6. Default route will be advertised from layer-3 switches into OSPF routing protocol. Access-layer switches at the core site will have two sets of external OSPF routes: specific routes originated by the PE routers and default route originated by core layer-3 switches. They will still prefer the specific routes originated by the PE routers. 7. OSPF will be disabled on PE-CE links.

10

BGP Essentials: Configuring Internal BGP Sessions http://blog.ioshints.info/2008/01/bgp-essentials-configuring-internal-bgp.html

11

BGP Essentials: BGP Communities http://blog.ioshints.info/2008/02/bgp-essentials-bgp-communities.html

© Copyright ipSpace.net 2014

Page 2-28

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

At this point, the PE routers stop receiving OSPF routes from the CE routers. The only central site routing information they have are EBGP routes received from PE-CE EBGP session. Likewise, the core site access-layer switches stop receiving specific remote site prefixes that were redistributed into OSPF on PE routers and rely exclusively on default route advertised by the core layer-3 switches. Figure 2-16 summarizes central site IP routing design.

Figure 2-16: Central site logical network topology and BGP+OSPF routing

© Copyright ipSpace.net 2014

Page 2-29

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

INTERNET VPN Two sets of EBGP sessions are established across DMVPN subnets. Each central site Internet VPN router (DMVPN hub router) has EBGP sessions with remote site Internet VPN routers in the same DMVPN subnet (DMVPN spoke routers). BGP community propagation will be configured on all EBGP sessions.

ROUTING POLICY ADJUSTMENTS The following changes will be made on central site Internet VPN routers to adjust the WAN network routing policies: 

VPN traffic flow through the central site: configure neighbor next-hop-self on DMVPN EBGP sessions. Central site Internet VPN routers start advertising their IP addresses as EBGP next hops for all EBGP prefixes, forcing the site-to-site traffic to flow through the central site.



Any-to-any VPN traffic flow: configure no neighbor next-hop-self on DMVPN EBGP sessions. Default EBGP next hop processing will ensure that the EBGP routes advertised through the central site routers retain the optimal BGP next hop – IP address of the remote site if the two remote sites connect to the same DMVPN subnet, or IP address of the central site router in any other case.



Internet VPN as the backup connectivity: Set BGP community 65000:1 (Backup route) on all EBGP updates sent from the central site routers. Remote site Internet VPN routers will lower the local preference of routes received over DMVPN EBGP sessions and thus prefer IBGP routes received from CPE router (which got the routes over MPLS/VPN WAN network).

© Copyright ipSpace.net 2014

Page 2-30

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars



Internet VPN as the primary connectivity: Set BGP community 65000:2 (Primary route) on all EBGP updates sent from the central site routers. Remote site Internet VPN routers will increase the local preference of routes received over DMVPN EBGP session and thus prefer those routes to IBGP routes received from the CPE router.

CONCLUSIONS A design with a single routing protocol running in one part of the network (example: WAN network or within a site) is usually less complex than a design that involves multiple routing protocols and route redistribution. When you have to combine MPLS/VPN WAN connectivity with any other WAN connectivity, you’re forced to incorporate BGP used within the MPLS/VPN network into your network design. Even though MPLS/VPN technology supports multiple PE-CE routing protocols, the service providers rarely implement IGP PE-CE routing protocols with all the features you might need for successful enterprise WAN integration. Provider-operated CE routers are even worse, as they cannot propagate MPLS/VPN-specific information (extended BGP communities) into enterprise IGP in which they participate. WAN network based on BGP is thus the only logical choice, resulting in a single protocol (BGP) being used in the WAN network. Incidentally, BGP provides a rich set of routing policy features, making your WAN network more flexible than it could have been were you using OSPF or EIGRP.

© Copyright ipSpace.net 2014

Page 2-31

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

3

BGP ROUTING IN DMVPN ACCESS NETWORK

IN THIS CHAPTER: EXISTING IP ROUTING OVERVIEW IBGP VERSUS EBGP IBGP AND EBGP BASICS ROUTE PROPAGATION BGP NEXT HOP PROCESSING

USING EBGP IN A DMVPN NETWORK SPOKE SITES HAVE UNIQUE AS NUMBERS USING EBGP WITH PHASE 1 DMVPN NETWORKS REDUCING THE SIZE OF THE SPOKE ROUTERS’ BGP TABLE AS NUMBER REUSE ACROSS SPOKE SITES © Copyright ipSpace.net 2014

Page 3-1

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

USING IBGP IN A DMVPN NETWORK DESIGN RECOMMENDATIONS

© Copyright ipSpace.net 2014

Page 3-2

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

A large enterprise (the Customer) has an existing international WAN backbone using BGP as the routing protocol. They plan to replace a regional access network with DMVPN-based solution and want to extend the existing BGP routing protocol into the access network to be able to scale the access network to several thousand sites. The initial DMVPN access network should offer hub-and-spoke connectivity, with any-to-any traffic implemented at a later stage.

Figure 3-1: Planned DMVPN network

© Copyright ipSpace.net 2014

Page 3-3

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

The Customer’s design team is trying to answer these questions: 

Should they use Internal BGP (IBGP) or External BGP (EBGP) in the DMVPN access network?



What autonomous system (AS) numbers should they use on remote (spoke) sites if they decide to use EBGP in the DMVPN access network?

EXISTING IP ROUTING OVERVIEW The existing WAN network is already using BGP routing protocol to improve the overall scalability of the network. The WAN backbone is implemented as a single autonomous system using the Customer’s public AS number. IBGP sessions within the WAN backbone are established between loopback interfaces and the Customer is using OSPF is exchange reachability information within the WAN backbone (nonbackbone routes are transported in BGP).

© Copyright ipSpace.net 2014

Page 3-4

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

The WAN backbone AS is using BGP route reflectors; new DMVPN hub routers will be added as route reflector clients to existing BGP topology.

Figure 3-2: BGP routing in existing WAN backbone

© Copyright ipSpace.net 2014

Page 3-5

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

IBGP VERSUS EBGP The following characteristics of IBGP and EBGP have to be considered when deciding whether to use single AS or multiple AS design12: 

Route propagation in IBGP and EBGP;



BGP next hop processing;



Route reflector behavior and limitations (IBGP only);



Typical IBGP and EBGP use cases;

IBGP AND EBGP BASICS An autonomous system is defined as a set of routers under a common administration and using common routing policies. IBGP is used to exchange routing information between all BGP routers within an autonomous system. IBGP sessions are usually established between non-adjacent routers (commonly using loopback interfaces); routers rely on an IGP routing protocol (example: OSPF) to exchange intra-AS reachability information. EBGP is used to exchange routing information between autonomous systems. EBGP sessions are usually between directly connected IP addresses of adjacent routers. EBGP was designed to work in without IGP.

12

IBGP or EBGP in an Enterprise Network http://blog.ioshints.info/2011/08/ibgp-or-ebgp-in-enterprise-network.html

© Copyright ipSpace.net 2014

Page 3-6

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

ROUTE PROPAGATION BGP loop prevention logic enforces an AS-level split horizon rule: 

Routes received from an EBGP peer are further advertised to all other EBGP and IBGP peers (unless an inbound or outbound filter drops the route);



Routes received from an IBGP peer are advertised to EBGP peers but not to other IBGP peers.

BGP route reflectors (RR) use slightly modified IBGP route propagation rules: 

Routes received from an RR client are advertised to all other IBGP and EBGP peers. RR-specific BGP attributes are added to the routes advertised to IBGP peers to detect IBGP loops.



Routes received from other IBGP peers are advertised to RR clients and EBGP peers.

The route propagation rules influence the setup of BGP sessions in a BGP network: 

EBGP sessions are established based on physical network topology;



IBGP networks usually use a set of route reflectors (or a hierarchy of route reflectors); IBGP sessions are established between all BGP-speaking routers in the AS and the route reflectors.

© Copyright ipSpace.net 2014

Page 3-7

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

BGP NEXT HOP PROCESSING The BGP next hop processing rules13 heavily influence the BGP network design and dictate the need of an IGP in IBGP networks: 

An BGP router advertising a BGP route without a NEXT HOP attribute (locally originated BGP route) sets the BGP next hop to the source IP address of the BGP session over which the BGP route is advertised;



A BGP router advertising a BGP route to an IBGP peer does not change the value of the BGP NEXT HOP attribute;



A BGP router advertising a BGP route to an EBGP peer sets the value of the BGP NEXT HOP attribute to the source IP address of the EBGP session unless the existing BGP NEXT HOP value belongs to the same IP subnet as the source IP address of the EBGP session.

You can modify the default BGP next hop processing rules with the following Cisco IOS configuration options: 

neighbor next-hop-self router configuration command sets the BGP NEXT HOP attribute to the source IP address of the BGP session regardless of the default BGP next hop processing rules.

13

BGP Next Hop Processing http://blog.ioshints.info/2011/08/bgp-next-hop-processing.html

© Copyright ipSpace.net 2014

Page 3-8

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

BGP route reflector cannot change the BGP attributes of reflected routes14. Neighbor nexthop-self is thus not effective on routes reflected by a route reflector.

Recent Cisco IOS releases support an extension to the neighbor next-hop-self command: neighbor address next-hop-self all configuration command causes a route server to change BGP next hops on all IBGP and EBGP routes sent to the specified neighbor. 

Inbound or outbound route maps can set the BGP NEXT HOP to any value with the set ip nexthop command (the outbound route maps are not applied to reflected routes). The most useful use of this command is the set ip next-hop peer-address used in an inbound route map. set ip next-hop peer-address sets BGP next hop to the IP address of BGP neighbor when used in an inbound route map or to the source IP address of the BGP session when used in an outbound route map.

14

BGP Route Reflectors http://wiki.nil.com/BGP_route_reflectors

© Copyright ipSpace.net 2014

Page 3-9

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

USING EBGP IN A DMVPN NETWORK There are two major reasons to use BGP in DMVPN networks: 

The number of spoke sites connected to a single hub site is large enough to cause scalability issues in other routing protocols (example: OSPF);



The customer wants to run a single routing protocol across multiple access networks (MPLS/VPN and DMVPN) to eliminate route redistribution and simplify overall routing design15.

In both cases, routing in the DMVPN network relies exclusively on BGP. BGP sessions are established between directly connected interfaces (across DMVPN tunnel) and there’s no IGP to resolve BGP next hops, making EBGP a better fit (at least based on standard BGP use cases). The customer has two choices when numbering the spoke DMVPN sites: 

Each spoke DMVPN site could become an independent autonomous system with a unique AS number;



All spoke DMVPN sites use the same autonomous system number.

SPOKE SITES HAVE UNIQUE AS NUMBERS The customer could allocate a unique AS number to each DMVPN spoke site, resulting in a BGP design closely aligned with the originally intended use of BGP. The hub router would have numerous BGP neighbors that would have to be configured individually (one neighbor per spoke site). Typical hub router configuration is displayed in Printout 3-1. 15

See Integrating Internet VPN with MPLS/VPN WAN case study for more details.

© Copyright ipSpace.net 2014

Page 3-10

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

router bgp 65000 bgp log-neighbor-changes neighbor 10.0.0.1 remote-as 65001 neighbor 10.0.0.2 remote-as 65002 ! ! more spoke sites ! neighbor 10.0.2.2 remote-as 65000

! Spoke A ! Spoke B

! Core route reflector

Printout 3-1: BGP configuration on the DMVPN hub router The BGP configuration of the spoke routers would be even simpler: one or more BGP neighbors (DMVPN hub routers) and the list of prefixes advertised by the DMVPN spoke site (see Printout 3-2). router bgp 65001 bgp log-neighbor-changes network 10.0.101.0 mask 255.255.255.0 network 192.168.1.0 neighbor 10.0.0.3 remote-as 65000 ! ! add more DMVPN hub routers if needed ! Printout 3-2: BGP configuration on a DMVPN spoke site For historic reasons the network BGP router configuration command requires the mask option unless the advertised prefix falls on major network boundary (class-A, B or C network).

© Copyright ipSpace.net 2014

Page 3-11

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Once the BGP sessions are established, the DMVPN hub and spoke routers start exchanging BGP prefixes. Prefixes advertised by DMVPN spoke sites retain their BGP next hop when the hub router propagates the prefix to other DMVPN spoke sites; for IBGP prefixes advertised by other BGP routers behind the hub router the hub router sets the next hop to the IP address of its DMVPN interface (see Printout 3-3 for a sample BGP table on a DMVPN spoke router). All printouts were generated in a test network connecting a DMVPN hub router (Hub) to two DMVPN spoke routers (RA and RB) with IP prefixes 192.168.1.0/24 and 192.168.2.0/24, and a core router with IP prefix 192.168.10.0/24.

RA#show ip bgp BGP table version is 6, local router ID is 192.168.1.1 Status codes: s suppressed, d damped, h history, * valid, > best, i - internal, r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter, x best-external, a additional-path, c RIB-compressed, Origin codes: i - IGP, e - EGP, ? - incomplete RPKI validation codes: V valid, I invalid, N Not found

*> *> *>

Network 192.168.1.0 192.168.2.0 192.168.10.0

Next Hop 0.0.0.0 10.0.0.2 10.0.0.3

Metric LocPrf Weight Path 0 32768 i 0 65000 65002 i 0 65000 i

Printout 3-3: BGP table on a DMVPN spoke router

© Copyright ipSpace.net 2014

Page 3-12

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

The AS path of the BGP routes indicates the sequence of AS numbers a BGP update had to traverse before being received by the hub site; it’s thus easy to figure out which DMVPN spoke site advertises a specific prefix or which prefixes a DMVPN spoke site advertises.

USING EBGP WITH PHASE 1 DMVPN NETWORKS The default EBGP next hop processing on the hub router works well with Phase 2 or Phase 3 DMVPN networks, but not for Phase 1 DMVPN networks – in those networks, the spoke sites cannot use any other next hop but the hub router’s IP address. You can adjust the EBGP next hop processing to the routing needs of Phase 1 DMVPN networks with the neighbor next-hop-self router configuration command configured on the hub router. After applying this command to our sample network (Printout 3-4) the hub router becomes the BGP next hop of all BGP prefixes received by DMVPN spoke sites (Printout 3-5). router bgp 65000 bgp log-neighbor-changes neighbor 10.0.0.1 remote-as 65001 neighbor 10.0.0.1 next-hop-self neighbor 10.0.0.2 remote-as 65002 neighbor 10.0.0.2 next-hop-self ! ! more spoke sites ! Printout 3-4: BGP configuration of a Phase 1 DMVPN hub router

© Copyright ipSpace.net 2014

Page 3-13

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

RA#show ip bgp BGP table version is 7, local router ID is 192.168.1.1 Status codes: s suppressed, d damped, h history, * valid, > best, i - internal, r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter, x best-external, a additional-path, c RIB-compressed, Origin codes: i - IGP, e - EGP, ? - incomplete RPKI validation codes: V valid, I invalid, N Not found

*> *> *>

Network 192.168.1.0 192.168.2.0 192.168.10.0

Next Hop 0.0.0.0 10.0.0.3 10.0.0.3

Metric LocPrf Weight Path 0 32768 i 0 65000 65002 i 0 65000 i

Printout 3-5: BGP routing table on a Phase 1 DMVPN spoke router

REDUCING THE SIZE OF THE SPOKE ROUTERS’ BGP TABLE Spoke routers might not need to learn all BGP prefixes known to the hub router. In most cases, it’s enough for the hub router to advertise the default route to the spoke routers (using the neighbor default-originate BGP router configuration command) and filter all other routes (for example, using an outbound prefix list that permits just the default route, as shown in Printout 3-6). See Printout 3-7 for the resulting BGP table on a spoke router.

© Copyright ipSpace.net 2014

Page 3-14

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

router bgp 65000 bgp log-neighbor-changes neighbor 10.0.0.1 remote-as 65001 neighbor 10.0.0.1 default-originate neighbor 10.0.0.1 prefix-list permitDefault out ! ip prefix-list permitDefault seq 5 permit 0.0.0.0/0 Printout 3-6: DMVPN hub router advertising just the default route to DMVPN spoke routers RA#show ip bgp BGP table version is 10, local router ID is 192.168.1.1 Status codes: s suppressed, d damped, h history, * valid, > best, i - internal, r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter, x best-external, a additional-path, c RIB-compressed, Origin codes: i - IGP, e - EGP, ? - incomplete RPKI validation codes: V valid, I invalid, N Not found

*> *>

Network 0.0.0.0 192.168.1.0

Next Hop 10.0.0.3 0.0.0.0

Metric LocPrf Weight Path 0 65000 i 0 32768 i

Printout 3-7: BGP routing table on a Phase 1 DMVPN spoke router The default route only design works well for Phase 1 or Phase 3 DMVPN networks. Phase 2 DMVPN networks require a slightly more complex approach: the hub router has to send all BGP prefixes

© Copyright ipSpace.net 2014

Page 3-15

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

advertised by other spoke sites (to retain proper BGP next hop value), and the default route that replaces all other BGP prefixes. DMVPN spoke sites might have to use IPsec frontdoor VRF if they rely on default routing within the enterprise network and toward the global Internet16. You could use an outbound route map that matches on BGP next hop value on the BGP hub router to achieve this goal (see Printout 3-8 for details). router bgp 65000 bgp log-neighbor-changes neighbor 10.0.0.1 remote-as 65001 neighbor 10.0.0.1 default-originate neighbor 10.0.0.1 route-map hub-to-spoke out ! ip access-list standard DMVPNsubnet permit 10.0.0.0 0.0.0.255 ! route-map hub-to-spoke permit 10 match ip next-hop DMVPNsubnet Printout 3-8: Phase 2 DMVPN hub router filtering non-DMVPN prefixes

16

See DMVPN: From Basics to Scalable Networks webinar for more details http://www.ipspace.net/DMVPN

© Copyright ipSpace.net 2014

Page 3-16

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

AS NUMBER REUSE ACROSS SPOKE SITES Keeping track of spoke site AS numbers gets exceedingly complex in very large deployments; in those cases it’s simpler to reuse the same AS number on all spoke sites. Even though all spoke sites use the same AS number, there’s no need for a full mesh of IBGP sessions (or route reflectors) between spoke routers. All BGP updates are propagated through the hub router. Since all EBGP neighbors (spoke sites) belong to the same autonomous system, it’s possible to use dynamic BGP neighbors configured with the bgp listen BGP router configuration command, significantly reducing the size of BGP configuration on the hub router (see Printout 3-9 for details). router bgp 65000 bgp log-neighbor-changes bgp listen range 10.0.0.0/24 peer-group spokes neighbor spokes peer-group neighbor spokes remote-as 65001 neighbor spokes default-originate Printout 3-9: BGP configuration on the DMVPN hub router BGP loop detection drops all EBGP updates that contain local AS number in the AS path; spoke sites thus discard all inbound updates originated by other spoke sites. Sample spoke router BGP table is shown in Printout 3-10 – note the lack of spoke-originated prefixes.

© Copyright ipSpace.net 2014

Page 3-17

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

RA>show ip bgp BGP table version is 15, local router ID is 192.168.1.1 Status codes: s suppressed, d damped, h history, * valid, > best, i - internal, r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter, x best-external, a additional-path, c RIB-compressed, Origin codes: i - IGP, e - EGP, ? - incomplete RPKI validation codes: V valid, I invalid, N Not found

*> *> *>

Network 0.0.0.0 192.168.1.0 192.168.10.0

Next Hop 10.0.0.3 0.0.0.0 10.0.0.3

Metric LocPrf Weight 0 0 32768 0

Path 65000 i i 65000 i

Printout 3-10: BGP table on DMVPN spoke router (prefixes originated by other spokes are missing) The default BGP loop prevention behavior might be ideal in DMVPN Phase 1 or Phase 3 networks (see Using EBGP with Phase 1 DMVPN Networks and Reducing the Size of the Spoke Routers’ BGP Table for more details), but is not appropriate for DMVPN Phase 2 networks. In DMVPN Phase 2 networks we have to disable the BGP loop prevention on spoke site routers with the neighbor allowas-in command17 (sample spoke router configuration is in Printout 3-11).

17

The use of this command in similarly-designed MPLS/VPN networks is described in details in MPLS and VPN Architectures book.

© Copyright ipSpace.net 2014

Page 3-18

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

router bgp 65001 bgp log-neighbor-changes network 10.0.101.0 mask 255.255.255.0 network 192.168.1.0 neighbor 10.0.0.3 remote-as 65000 neighbor 10.0.0.3 allowas-in 1 Printout 3-11: Disabling BGP loop prevention logic on a DMVPN spoke router Disabling BGP loop prevention logic is dangerous – prefixes originated by a DMVPN spoke router are sent back (and accepted) to the same spoke router (example: prefix 192.168.1.0 in Printout 3-12), and it’s possible to get temporary forwarding loops or long-term instabilities in designs with multiple BGP speaking hub routers. The maximum number of occurrences of local AS number in the AS path specified in the neighbor allowas-in should thus be kept as low as possible (ideal value is one).

© Copyright ipSpace.net 2014

Page 3-19

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

RA#show ip bgp BGP table version is 16, local router ID is 192.168.1.1 Status codes: s suppressed, d damped, h history, * valid, > best, i - internal, r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter, x best-external, a additional-path, c RIB-compressed, Origin codes: i - IGP, e - EGP, ? - incomplete RPKI validation codes: V valid, I invalid, N Not found

*> * *> *> *>

Network 0.0.0.0 192.168.1.0 192.168.2.0 192.168.10.0

Next Hop 10.0.0.3 10.0.0.3 0.0.0.0 10.0.0.2 10.0.0.3

Metric LocPrf Weight 0 0 0 32768 0 0

Path 65000 65000 i 65000 65000

i 65001 i 65001 i i

Printout 3-12: Duplicate prefix on a DMVPN spoke router caused by a BGP update loop Alternatively, one could adjust the AS path on updates sent by the DMVPN hub router with the neighbor as-override router configuration command18 (see Printout 3-13), which replaces all instances of neighbor AS number with local AS number. The resulting BGP table on a DMVPN spoke router is shown in Printout 3-14.

18

The neighbor as-override command is extensively described in MPLS and VPN Architectures book.

© Copyright ipSpace.net 2014

Page 3-20

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

router bgp 65000 bgp log-neighbor-changes bgp listen range 10.0.0.0/24 peer-group spokes neighbor spokes peer-group neighbor spokes remote-as 65001 neighbor spokes as-override Printout 3-13: Deploying AS-override on DMVPN hub router RA>show ip bgp BGP table version is 17, local router ID is 192.168.1.1 Status codes: s suppressed, d damped, h history, * valid, > best, i - internal, r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter, x best-external, a additional-path, c RIB-compressed, Origin codes: i - IGP, e - EGP, ? - incomplete RPKI validation codes: V valid, I invalid, N Not found

*> * *> *> *>

Network 0.0.0.0 192.168.1.0 192.168.2.0 192.168.10.0

Next Hop 10.0.0.3 10.0.0.3 0.0.0.0 10.0.0.2 10.0.0.3

Metric LocPrf Weight 0 0 0 32768 0 0

Path 65000 65000 i 65000 65000

i 65000 i 65000 i i

Printout 3-14: BGP table with modified AS paths on DMVPN spoke router

© Copyright ipSpace.net 2014

Page 3-21

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

AS paths generated with neighbor as-override router configuration command are indistinguishable from paths generated with AS-path prepending19, resulting in more complex troubleshooting. The neighbor as-override command should thus be used only when there’s no viable alternative.

USING IBGP IN A DMVPN NETWORK Unless the DMVPN spoke sites have to follow third-party AS numbering convention (example: running DMVPN in parallel with provider’s MPLS/VPN), it seems easier to use IBGP than EBGP: 

Make the DMVPN hub router a route reflector;



Establish IBGP sessions between hub- and spoke sites using directly-connected IP addresses belonging to the DMVPN tunnel interface.

IBGP hub router configuration using dynamic BGP neighbors is extremely simple, as evidenced by the sample configuration in Printout 3-15.

19

BGP Essentials: AS path prepending http://blog.ipspace.net/2008/02/bgp-essentials-as-path-prepending.html

© Copyright ipSpace.net 2014

Page 3-22

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

router bgp 65000 bgp log-neighbor-changes bgp listen range 10.0.0.0/24 peer-group spokes neighbor spokes peer-group neighbor spokes remote-as 65000 neighbor spokes route-reflector-client Printout 3-15: IBGP configuration on DMVPN hub router This approach works well in networks that use IBGP exclusively within the DMVPN network, as all IBGP next hops belong to the DMVPN network (and are thus reachable by all spoke routers). Designs satisfying this requirement include: 

Networks that don’t use BGP beyond the boundaries of DMVPN access network (core WAN network might use an IGP like OSPF or EIGRP);



Networks that run DMVPN BGP routing in a dedicated autonomous system.

In all other cases, lack of BGP next hop processing across IBGP sessions (explained in the BGP Next Hop Processing section) causes connectivity problems. For example, in our sample network the spoke routers cannot reach destinations beyond the DMVPN hub router – BGP refuses to use those prefixes because the DMVPN spoke router cannot reach the BGP next hop (Printout 3-16).

© Copyright ipSpace.net 2014

Page 3-23

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

RA#show ip bgp BGP table version is 9, local router ID is 192.168.1.1 Status codes: s suppressed, d damped, h history, * valid, > best, i - internal, r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter, x best-external, a additional-path, c RIB-compressed, Origin codes: i - IGP, e - EGP, ? - incomplete RPKI validation codes: V valid, I invalid, N Not found Network *> 192.168.1.0 *>i 192.168.2.0 * i 192.168.10.0

Next Hop 0.0.0.0 10.0.0.2 10.0.2.2

Metric LocPrf Weight Path 0 32768 i 0 100 0 i 0 100 0 i

RA#show ip bgp 192.168.10.0 BGP routing table entry for 192.168.10.0/24, version 9 Paths: (1 available, no best path) Not advertised to any peer Refresh Epoch 2 Local 10.0.2.2 (inaccessible) from 10.0.0.3 (192.168.3.1) Origin IGP, metric 0, localpref 100, valid, internal Originator: 10.11.12.7, Cluster list: 192.168.3.1

Printout 3-16: DMVPN spoke routers cannot reach prefixes behind the DMVPN hub router

© Copyright ipSpace.net 2014

Page 3-24

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

There are at least four approaches one can use to fix the IBGP next hop problem: 

Use default routing in DMVPN network (see section Reducing the Size of the Spoke Routers’ BGP Table for more details) unless you’re using Phase 2 DMVPN.



Advertise default route from the DMVPN hub router with the neighbor default-information router configuration command. DMVPN spokes will use the default route to reach IBGP next hop. Some versions of Cisco IOS might not use an IBGP route to resolve a BGP next hop. Check the behavior of your target Cisco IOS version before deciding to use this approach.



Change the IBGP next hop on all spoke routers with an inbound route map using set ip nexthop peer-address route map configuration command. This approach increases the complexity of spoke site routers’ configuration and is thus best avoided.



Change the IBGP next hop on DMVPN hub router with the neighbor next-hop-self all router configuration command. This feature was introduced recently and might not be available on the target DMVPN hub router.

© Copyright ipSpace.net 2014

Page 3-25

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

DESIGN RECOMMENDATIONS BGP is the only routing protocol that scales to several thousand DMVPN nodes (the target size of the DMVPN access network). The Customer’s DMVPN access network should thus rely on BGP without underlying IGP. Default routing fits current customer’s requirements (hub-and-spoke traffic), potential future direct spoke-to-spoke traffic connectivity can be implemented with default routing and Phase 3 DMVPN. Conclusion#1: Customer will use default routing over BGP. Hub router will advertise the default route (and no other BGP prefix) to the spokes. Spoke routers could use static host routes to send IPsec traffic to the hub router in the initial deployment. Implementing spoke-to-spoke connectivity with static routes is time-consuming and error prone, particularly in environments with dynamic spoke transport addresses. The customer would thus like to use default routing toward the Internet. Conclusion#2: Customer will use IPsec frontdoor VRF with default routing toward the Internet. The customer does not plan to connect spoke DMVPN sites to any other access network (example: MPLS/VPN), so they’re free to choose any AS numbering scheme they wish. Any EBGP or IBGP design described in this document would meet customer routing requirements; IBGP is the easiest one to implement and modify should the future needs change (assuming the DMVPN hub router supports neighbor next-hop-self all functionality). Conclusion#3: Customer will use IBGP in DMVPN access network.

© Copyright ipSpace.net 2014

Page 3-26

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

4

REDUNDANT DATA CENTER INTERNET CONNECTIVITY

IN THIS CHAPTER: SIMPLIFIED TOPOLOGY IP ADDRESSING AND ROUTING DESIGN REQUIREMENTS FAILURE SCENARIOS

SOLUTION OVERVIEW LAYER-2 WAN BACKBONE BENEFITS AND DRAWBACKS OF PROPOSED TECHNOLOGIES IP ROUTING ACROSS LAYER-2 WAN BACKBONE BGP ROUTING OUTBOUND BGP PATH SELECTION © Copyright ipSpace.net 2014

Page 4-1

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

LAYER-3 WAN BACKBONE BGP-BASED WAN BACKBONE DEFAULT ROUTING IN LAYER-3 WAN BACKBONE

CONCLUSIONS

© Copyright ipSpace.net 2014

Page 4-2

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

A large enterprise (the Customer) has two data centers linked with a fully redundant layer-3 Data Center Interconnect (DCI) using an unspecified transport technology. Each data center has two redundant Internet connections (see Figure 4-1 for details).

Figure 4-1: Redundant data centers and their internet connectivity

The customer would like to make the Internet connectivity totally redundant. For example: if both Internet connections from DC1 fail, the public IP prefix of DC1 should remain accessible through Internet connections of DC2 and the DCI link.

© Copyright ipSpace.net 2014

Page 4-3

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

SIMPLIFIED TOPOLOGY All critical components of a redundant data center design should be redundant, but it’s sometimes easier to disregard the redundancy of the components not relevant to a particular portion of the overall design (our scenario: firewalls and DCI routers) to simplify the design discussions (see Figure 4-2).

Figure 4-2: Simplified topology with non-redundant internal components

© Copyright ipSpace.net 2014

Page 4-4

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Redundant firewalls are usually implemented as active/standby or active/active pairs with stateful failover and appear as a single logical device to the adjacent hosts or network devices. Redundant DCI switches could also be merged into a single logical device using technologies like VSS (Cisco), IRF (HP) or Virtual Chassis (Juniper). The simplified topology thus accurately represents many real-life deployment scenarios. We’ll further assume that the two sites do not have significant campus networks attached to them. The outbound traffic traversing the Internet links is thus generated solely by the servers (example: web hosting) and not by end-users surfing the Internet. You can easily adapt the design to a mixed campus/data center design by modeling the campus networks as separate sites attached to the same firewalls or Internet edge LAN.

IP ADDRESSING AND ROUTING Each data center has a distinct public IPv4 prefix configured on the outside (Internet-facing) LAN. Firewalls protecting the servers within the data center are connected to the Internet-facing LAN and perform NAT between public outside addresses and RFC 1918 inside addresses. Internet edge routers connect the public LAN segment to the Internet, run BGP with the ISP’s edge routers, and provide a single virtual exit point to the firewalls through a first-hop redundancy

© Copyright ipSpace.net 2014

Page 4-5

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

protocol (example: HSRP). At the moment there’s no dynamic routing between the Internet edge routers and any network devices. Customer’s DCI routers connect the internal data center networks and currently don’t provide transit services. Static default routes pointing to the local firewall inside IP address are used on the data center core switches.

Figure 4-3: BGP sessions between Internet edge routers and the ISPs.

© Copyright ipSpace.net 2014

Page 4-6

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

DESIGN REQUIREMENTS A redundant Internet access solution must satisfy the following requirements: 

Resilient inbound traffic flow: both sites must advertise IP prefixes assigned to DC1 and DC2 to the Internet;



No session loss: Failure of one or more Internet-facing links must not result in application session loss;



Optimal inbound traffic flow: Traffic for IP addresses in one of the data centers should arrive over uplinks connected to the same data center; DCI link should be used only when absolutely necessary.



Optimal outbound traffic flow: Outbound traffic must take the shortest path to the Internet; as above, DCI link should be used only when absolutely necessary.



No blackholing: A single path failure (one or both Internet links on a single site, or one or more DCI links) should not cause traffic blackholing.

FAILURE SCENARIOS This document describes a network that is designed to survive the following failures: 

Single device or link failure anywhere in the network;



Total Internet connectivity failure in one data center;



Total DCI link failure;

© Copyright ipSpace.net 2014

Page 4-7

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

The design described in this document does not address a total data center failure; you’d need a manual or automatic failover mechanism addressing network, compute and storage components to achieve that goal.

SOLUTION OVERVIEW We can meet all the design requirements by redesigning the Internet Edge layer of the corporate network to resemble a traditional Internet Service Provider design20. The Internet Edge layer of the new network should have: 

WAN backbone providing internal connectivity (see Figure 4-4);



Edge or peering routers connecting the WAN backbone to Internet peers or upstream providers;



Edge routers connecting sites to the WAN backbone. In our network, upstream links and site subnets connect to the same edge routers.

The missing component in the current Internet Edge layer is the WAN backbone. Assuming we have to rely on the existing WAN connectivity between DC1 and DC2, the DCI routers (D11 through D22) have to become part of the Internet Edge layer (outside) WAN backbone as shown in Figure 4-4.

20

For more details, watch the Redundant Data Center Internet Connectivity video http://demo.ipspace.net/get/X1%20Redundant%20Data%20Center%20Internet%20Connectivity.mp4

© Copyright ipSpace.net 2014

Page 4-8

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 4-4: Outside WAN backbone in the redesigned network

The outside WAN backbone can be built with any one of these technologies: 

Point-to-point Ethernet links or stretched VLANs between Internet edge routers. This solution requires layer-2 connectivity between the sites and is thus the least desirable option;



GRE tunnels between Internet edge routers;



Virtual device contexts on DCI routers to split them in multiple independent devices (example: Nexus 7000). WAN backbone implemented in a virtual device context on Nexus 7000 would require dedicated physical interfaces (additional inter-DC WAN links).

© Copyright ipSpace.net 2014

Page 4-9

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars



VRFs on the DCI routers to implement another forwarding context for the outside WAN backbone.

Regardless of the technology used to implement the WAN backbone, all the proposed solutions fall in two major categories: 

Layer-2 solutions, where the DCI routers provide layer-2 connectivity between Internet edge routers, either in form of point-to-point links between Internet edge routers or site-to-site VLAN extension. GRE tunnels between Internet edge routers are just a special case of layer-2 solution that does not involve DCI routers at all.



Layer-3 solutions, where the DCI routers participate in the WAN backbone IP forwarding.

LAYER-2 WAN BACKBONE Numerous technologies could be used to implement point-to-point links between Internet edge routers or site-to-site VLAN extensions across the layer-3 DCI link: 

Point-to-point links can be implemented with Ethernet-over-MPLS (EoMPLS) or L2TPv3 on DCI routers. EoMPLS might have to be combined with MPLS-over-IP in scenarios where the DCI link cannot provide MPLS connectivity;



Virtual Private LAN Service (VPLS) could be configured on DCI routers (combined with MPLSover-IP if needed) to provide site-to-site VLAN extension between Internet edge routers;

© Copyright ipSpace.net 2014

Page 4-10

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars



Overlay Transport Virtualization (OTV) could be configured on DCI routers to provide site-to-site VLAN extensions;



GRE tunnels configured on Internet edge routers provide point-to-point links without involvement of DCI routers. All layer-2 tunneling technologies introduce additional encapsulation overhead and thus require increased MTU on the path between Internet edge routers (GRE tunnels) or DCI routers (all other technologies), as one cannot rely on proper operation of Path MTU Discovery (PMTUD) across the public Internet21.

BENEFITS AND DRAWBACKS OF PROPOSED TECHNOLOGIES Point-to-point links between Internet edge routers implemented with L2TPv3 or EoMPLS on DCI routers introduce additional unnecessary interdependencies between Internet edge routers, DCI routers and DCI links.

21

The never-ending story of IP fragmentation http://stack.nil.com/ipcorner/IP_Fragmentation/

© Copyright ipSpace.net 2014

Page 4-11

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 4-5: Point-to-point Ethernet links implemented with EoMPLS on DCI routers

Consider the potential failure scenarios in the simple topology from Figure 4-5 where the fully redundant DCI backbone implements EoMPLS point-to-point links between Internet edge routers. 

Failure of DCI link #1 (or DCI routers D11 or D21) causes the E1-to-E3 virtual link to fail;



Subsequent failure of E2 or E4 results in a total failure of the WAN backbone although there are still alternate paths that could be used if the point-to-point links between Internet edge routers wouldn’t be tightly coupled with the physical DCI components.

Site-to-site VLAN extensions are slightly better in that respect; well-designed fully redundant stretched VLANs (Figure 4-6) can decouple DCI failures from Internet edge failures.

© Copyright ipSpace.net 2014

Page 4-12

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 4-6: Single stretched VLAN implemented with VPLS across L3 DCI

You could achieve the proper decoupling with a single WAN backbone VLAN that follows these rules: 

The VLAN connecting Internet edge routers MUST be connected to all physical DCI devices (preventing a single DCI device failure from impacting the inter-site VLAN connectivity);



Redundant independent DCI devices MUST use a rapidly converging protocol (example: rapid spanning tree) to elect the primary forwarding port connected to the WAN backbone VLAN. You could use multi-chassis link aggregation groups when DCI devices appear as a single logical device (example: VSS, IRF, Virtual Chassis).



Every DCI router MUST be able to use all DCI links to forward the WAN backbone VLAN traffic, or shut down the VLAN-facing port when its DCI WAN link fails.

© Copyright ipSpace.net 2014

Page 4-13

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Fully redundant stretched VLANs are hard to implement with today’s technologies22 (example: OTV supports a single transport interface on both NX-OS and IOS XE); it might be simpler to provide two non-redundant WAN backbone VLANs and connect Internet edge routers to both of them as shown in Figure 4-7 (a solution that cannot be applied to server subnets but works well for router-to-router links23).

Figure 4-7: Two non-redundant stretched VLANs provide sufficient end-to-end redundancy

22

See Data Center Interconnects webinar for more details

23

The difference between Metro Ethernet and stretched data center subnets http://blog.ioshints.info/2012/07/the-difference-between-metro-ethernet.html

© Copyright ipSpace.net 2014

Page 4-14

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

GRE tunnels established directly between Internet edge routers might be the simplest solution. They rely on IP transport provided by the DCI infrastructure and can use whatever path is available (keep in mind the increased MTU requirements). Their only drawback is a perceived security risk: traffic that has not been inspected by the firewalls is traversing the internal infrastructure.

IP ROUTING ACROSS LAYER-2 WAN BACKBONE Virtual links or stretched VLANs between Internet edge routers appear as point-to-point links (Figure 4-8) or LAN segments (Figure 4-9) to IGP and BGP routing protocols running on Internet edge routers E1 through E4.

Figure 4-8: Virtual topology using point-to-point links

© Copyright ipSpace.net 2014

Page 4-15

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 4-9: Virtual topology using stretched VLANs

The IP routing design of the WAN backbone should thus follow the well-known best practices used by Internet Service Provider networks: 

Configure a fast-converging IGP between Internet edge routers.



Run IBGP between loopback interfaces of Internet edge routers24.



Configure a full mesh of IBGP sessions as shown in Figure 4-10 (it does not make sense to introduce route reflectors in a network with four BGP routers).

24

BGP Essentials: Configuring Internal BGP Sessions http://blog.ioshints.info/2008/01/bgp-essentials-configuring-internal-bgp.html

© Copyright ipSpace.net 2014

Page 4-16

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 4-10: Full mesh of IBGP sessions between Internet edge routers



Configure BGP community propagation on all IBGP sessions25.



Use BGP next-hop-self on IBGP sessions to decouple the IBGP routes from external subnets.



Use BFD26 or BGP next-hop tracking for fast failure detection27.

25

BGP Essentials: BGP Communities http://blog.ioshints.info/2008/02/bgp-essentials-bgp-communities.html

26

Bidirectional Forwarding Detection http://wiki.nil.com/Bidirectional_Forwarding_Detection_(BFD)

27

Fast BGP neighbor loss detection http://wiki.nil.com/Fast_BGP_neighbor_loss_detection

© Copyright ipSpace.net 2014

Page 4-17

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

BGP ROUTING The design of BGP routing between Internet edge routers should follow the usual non-transit autonomous system best practices: 

Every Internet edge router should advertise directly connected public LAN prefix (on Cisco IOS, use network statement, not route redistribution);



Do not configure static routes to null 0 on the Internet edge routers; they should announce the public LAN prefix only when they can reach it28.



Use BGP communities to tag the locally advertised BGP prefixes as belonging to DC1 or DC2 (on Cisco IOS, use network statement with route-map option29).



Use outbound AS-path filters on EBGP sessions with upstream ISPs to prevent transit route leakage across your autonomous system30.



Use AS-path prepending31, Multi-Exit Discriminator (MED) or ISP-defined BGP communities for optimal inbound traffic flow (traffic destined for IP addresses in public LAN of DC1 should arrive through DC1’s uplinks if at all possible). Example: E3 and E4 should advertise prefixes from DC1 with multiple copies of Customer’s public AS number in the AS path.

28

The road to complex designs is paved with great recipes http://blog.ioshints.info/2011/08/road-to-complex-designs-is-paved-with.html

29

BGP Essentials: Advertising Public IP Prefixes Into the Internet http://blog.ioshints.info/2008/01/bgp-essentials-advertising-public-ip.html

30

Non-transit Autonomous System Design http://wiki.nil.com/%28Non%29Transit_Autonomous_System

31

BGP Essentials: AS Path Prepending http://blog.ioshints.info/2008/02/bgp-essentials-as-path-prepending.html

© Copyright ipSpace.net 2014

Page 4-18

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars



BGP attributes of BGP prefixes advertised to upstream ISPs (longer AS path, MED or additional communities) should be based on BGP communities attached to the advertised IP prefixes.

OUTBOUND BGP PATH SELECTION The only deviation from the simplest BGP routing design is the outbound path selection specified in the Design Requirements section: Internet edge routers should always prefer paths received over uplinks in the same data center over paths received through the remote data center. You can implement the outbound path selection using one of these designs: Full Internet routing with preferred local exit. All Internet edge routers propagate EBGP routes received from upstream ISPs to all IBGP peers (you might want to reduce the number of EBGP routes to speed up the convergence32). Local preference is reduced on updates received from IGBP peers residing in another data center (on Cisco IOS use neighbor route-map). This routing policy has the following outcomes: 

The same BGP path is received from EBGP peer and local IBGP peer: EBGP path is preferred over IBGP path;



Different BGP paths to the same IP prefix are received from EBGP peer and local IBGP peer. Both paths have the same BGP local preference; other BGP attributes (starting with AS path length) are used to select the best path;



The same BGP path is received from local IBGP peer and an IBGP peer from another data center. The path received from local IBGP peer has higher local preference and is thus always preferred;

32

BGP Convergence Optimization, a case study in Webinar Management System

© Copyright ipSpace.net 2014

Page 4-19

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars



Different BGP paths to the same IP prefix are received from local IGBP peer and IBGP peer from another data center. The path received from local IBGP peer has higher local preference; other BGP attributes are ignored in the BGP path selection process – the path received from local IBGP peer is always preferred.

In all cases, the outbound traffic uses a local uplink assuming at least one of the upstream ISPs advertised a BGP path to the destination IP address. Default routing between data centers. Redundant BGP paths received from EBGP and IBGP peers increase the memory requirements of Internet edge routers and slow down the convergence process. You might want to reduce the BGP table sizes on Internet edge routers by replacing full IBGP routing exchange between data centers with default routing: 

Internet edge routers in the same data center exchange all BGP prefixes received from EBGP peers to ensure optimal outbound traffic flow based on information received from upstream ISPs;



Default route and locally advertised prefixes (BGP prefixes with empty AS path) are exchanged between IGBP peers residing in different data centers.

With this routing policy, Internet edge routers always use the local uplinks for outbound traffic and fall back to default route received from the other data center only when there is no local path to the destination IP address. Two-way default routing between data centers might result in packet forwarding loops. If at all possible, request default route origination from the upstream ISPs and propagate only ISP-generated default routes to IBGP peers.

© Copyright ipSpace.net 2014

Page 4-20

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

LAYER-3 WAN BACKBONE In a layer-3 WAN backbone, the DCI routers participate in Internet edge IP forwarding. This design is more resilient and somewhat simpler than layer-2 WAN backbone (it does not require stretched VLANs or point-to-point Ethernet links between Internet edge routers). On the other hand, the need to have the same DCI links in multiple security zones, and share physical DCI devices between internal and external WAN backbones and introduces additional complexity. There are two well-known technologies that can be used to implement multiple independent layer-3 forwarding domains (and thus security zones) on a single device: virtual device contexts (VDC) and virtual routing and forwarding tables (VRFs). Virtual Device Contexts would be an ideal solution if they could share the same physical link. Unfortunately, no current VDC implementation supports this requirement – device contexts are associated with physical interfaces, not VLANs. VDCs would thus require additional DCI links (or lambdas in WDM-based DCI infrastructure) and are clearly inappropriate solution in most environments.

© Copyright ipSpace.net 2014

Page 4-21

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 4-11: Virtual Device Contexts: dedicated management planes and physical interfaces

© Copyright ipSpace.net 2014

Page 4-22

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Virtual Routing and Forwarding tables seem to be a better fit; most devices suitable for data center edge deployment support them, and it’s relatively easy to associate a VRF with physical interfaces or VLAN (sub)interfaces.

Figure 4-12: Virtual Routing and Forwarding tables: shared management, shared physical interfaces

BGP-BASED WAN BACKBONE Typical scalable ISP core networks transport core routes in a core IGP process (OSPF or IS-IS), and customer (or edge) and transit routes in BGP.

© Copyright ipSpace.net 2014

Page 4-23

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

All routers in a layer-3 backbone must have a consistent forwarding behavior. This requirement can be met by all core routers running IGP+BGP (and participating in full Internet routing) or by core routers running IGP+MPLS and providing label switched paths between BGP-running edge routers (BGP-free core). The DCI routers in the WAN backbone should thus either participate in IBGP mesh (acting as BGP route reflectors to reduce the IBGP mesh size, see Figure 4-13) or provide MPLS transport between Internet edge routers as shown in Figure 4-14.

Figure 4-13: BGP core in WAN backbone

© Copyright ipSpace.net 2014

Page 4-24

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 4-14: MPLS core in WAN backbone

Both designs are easy to implement on dedicated high-end routers or within a separate VDC on a Nexus 7000; the VRF-based implementations are way more complex: 

Many MPLS- or VRF-enabled devices do not support IBGP sessions within a VRF; only EBGP sessions are allowed between PE- and CE-routers (Junos supports IBGP in VRF in recent software releases). In an MPLS/VPN deployment, the DCI routers would have to be in a private AS inserted between two disjoint parts of the existing public AS. Multi-VRF or EVN deployment would be even worse: each DCI router would have to be in its own autonomous system.



MPLS transport within a VRF requires support for Carrier’s Carrier (CsC) architecture; at the very minimum, the DCI routers should be able to run Label Distribution Protocol (LDP) within a VRF.

© Copyright ipSpace.net 2014

Page 4-25

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

While both designs could be implemented on numerous data center edge platforms (including Cisco 7200, Cisco 7600 and Juniper MX-series routers), they rely on technologies not commonly used in data center environments and might thus represent a significant deployment and operational challenge.

DEFAULT ROUTING IN LAYER-3 WAN BACKBONE We can adapt the default routing between data centers design described in the Outbound BGP Path Selection section to implement a layer-3 WAN backbone with technologies commonly used in enterprise data center environments. IGP part of this design is trivial: an IGP protocol with VRF support (OSPF is preferred over EIGRP due to its default routing features) is run between Internet edge routers and VRF instances of DCI routers. Simple VLAN-based VRFs (not MPLS/VPN) or Easy Virtual Networking (EVN) is used between DCI routers to implement end-to-end WAN backbone connectivity. The BGP part of this design is almost identical to the BGP Routing design with few minor modifications: 

DCI routers do not participate in BGP routing;



IBGP sessions between routers in different data centers are used solely to propagate locallyoriginated routes. No external BGP routes are exchanged between data centers.



Default route is not advertised in IBGP sessions but in IGP.

© Copyright ipSpace.net 2014

Page 4-26

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 4-15: Default routing in WAN backbone

IBGP sessions between data centers could also be replaced with local prefix origination – all Internet edge routers in both data centers would advertise public LAN prefixes from all data centers (using route-map or similar mechanism to set BGP communities), some of them based on connected interfaces, others based on IGP information. Outbound traffic forwarding in this design is based on default routes advertised by all Internet edge routers. An Internet edge router should advertise a default route only when its Internet uplink (and corresponding EBGP session) is operational to prevent suboptimal traffic flow or blackholing. Local

© Copyright ipSpace.net 2014

Page 4-27

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

default routes should be preferred over default routes advertised from other data centers to ensure optimal outbound traffic flow. The following guidelines could be used to implement this design with OSPF on Cisco IOS: 

Internet edge routers that receive default route from upstream ISPs through EBGP session should be configured with default-route originate33 (default route is originated only when another non-OSPF default route is already present in the routing table);



Internet edge routers participating in default-free zone (full Internet routing with no default routes) should advertise default routes when they receive at least some well-known prefixes (example: root DNS servers) from upstream ISP34. Use default-route originate always routemap configuration command and use the route map to match well-known prefixes.



Use external type-1 default routes to ensure DCI routers prefer locally-originated default routes (even when they have unequal costs to facilitate primary/backup exit points) over default routes advertised from edge routers in other data centers.

33

See OSPF Default Mysteries for more details http://stack.nil.com/ipcorner/OSPFDefaultMysteries/

34

Conditional OSPF default route origination based on classless IP prefixes http://wiki.nil.com/Conditional_OSPF_default_route_origination_based_on_classless_IP_prefixes

© Copyright ipSpace.net 2014

Page 4-28

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

CONCLUSIONS A network with multiple data centers and requirements for seamless failover following any combination of link/device failures must have an external WAN backbone (similar to Internet Service Provider core networks) with individual data centers and other sites connected to the backbone via firewalls or other intermediate devices. In most cases the external WAN backbone has to share WAN links and physical devices with internal data center interconnect links, while still maintaining strict separation of security zones and forwarding planes. The external WAN backbone could be implemented as either layer-2 backbone (using layer-2 tunneling mechanisms on DCI routers) or layer-3 backbone (with DCI routers participating in WAN backbone IP forwarding). Numerous technologies could be used to implement the external WAN backbone with the following ones being the least complex from the standpoint of a typical enterprise data center networking engineer: 

Point-to-point GRE tunnels between Internet edge routers;



WAN backbone implemented in a separate VRF on DCI routers with default routing used for outbound traffic forwarding.

© Copyright ipSpace.net 2014

Page 4-29

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

5

EXTERNAL ROUTING WITH LAYER-2 DATA CENTER INTERCONNECT

IN THIS CHAPTER: IP ADDRESSING AND ROUTING REDUNDANCY REMOVED TO SIMPLIFY DESIGN DISCUSSIONS

DESIGN REQUIREMENTS FAILURE SCENARIOS

SOLUTION OVERVIEW DETAILED SOLUTION – OSPF FAILURE ANALYSIS

DETAILED SOLUTION – INTERNET ROUTING WITH BGP PREFIX ORIGINATION

© Copyright ipSpace.net 2014

Page 5-1

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

NEXT-HOP PROCESSING BGP TABLE DURING NORMAL OPERATIONS EBGP ROUTE ADVERTISEMENTS FAILURE ANALYSIS

CONCLUSIONS

© Copyright ipSpace.net 2014

Page 5-2

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

ACME Enterprises has two data centers linked with a layer-2 Data Center Interconnect (DCI) implemented with Cisco’s Overlay Transport Virtualization (OTV). Each data center has connections to the Internet and enterprise WAN network connecting data centers with remote offices (see Figure 5-1 for details). Enterprise WAN network is implemented with MPLS/VPN services.

Figure 5-1: Redundant data centers and their internet connectivity

Layer-2 DCI was used to avoid IP renumbering in VM mobility and disaster recovery scenarios. Occasional live migration between data centers is used during maintenance and hardware upgrades operations.

© Copyright ipSpace.net 2014

Page 5-3

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

IP ADDRESSING AND ROUTING Numerous IPv4 and IPv6 subnets in different security zones are used within the two data centers. Even though the data centers operate in active-active mode, individual applications typically don’t span both two data centers for performance reasons. Every IPv4 and IPv6 subnet thus has a primary and a backup data center.

Figure 5-2: IP addressing and routing with external networks

© Copyright ipSpace.net 2014

Page 5-4

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

ACME uses OSPF within its MPLS/VPN network and BGP with upstream Internet Service Providers (ISPs).

REDUNDANCY REMOVED TO SIMPLIFY DESIGN DISCUSSIONS All critical components of a highly available data center design should be redundant, but it’s sometimes easier to disregard the redundancy of the components not relevant to a particular portion of the overall design to simplify the design discussions. We’ll assume none of the components or external links are redundant (see Figure 5-3), but it’s relatively simple to extend a layer-3 design with redundant components.

© Copyright ipSpace.net 2014

Page 5-5

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 5-3: Simplified topology with non-redundant components

Redundant firewalls are usually implemented as active/standby or active/active pairs with stateful failover and appear as a single logical device to the adjacent hosts or network devices. Redundant DCI switches could also be merged into a single logical device using technologies like VSS (Cisco), IRF (HP) or Virtual Chassis (Juniper).

© Copyright ipSpace.net 2014

Page 5-6

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

DESIGN REQUIREMENTS Layer-2 DCI is the least desirable data center interconnect solution35, as it extends a single broadcast domain (and thus a single failure domain) across multiple sites, turning them into a single availability zone36. Furthermore, DCI link failure might result in a split-brain scenario where both sites advertise the same IP subnet, resulting in misrouted (and thus black-holed) traffic37. External routing between the two data centers and both Internet and enterprise WAN (MPLS/VPN) network should thus ensure that: 

Every data center subnet remains reachable after a single link or device failure;



DCI link failure does not result in a split-brain scenario with traffic for the same subnet being sent to both data centers.



Backup data center (for a particular VLAN/subnet) advertises the subnet after the primary data center failure.

35

See Data Center Interconnects webinar for more details

36

Layer-2 is a single failure domain http://blog.ioshints.info/2012/05/layer-2-network-is-single-failure.html

37

The difference between Metro Ethernet and stretched data center subnets http://blog.ioshints.info/2012/07/the-difference-between-metro-ethernet.html

© Copyright ipSpace.net 2014

Page 5-7

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

FAILURE SCENARIOS The design described in this document should provide uninterrupted external connectivity under the following conditions: 

Single device or link failure anywhere in the data center network edge;



Total external connectivity failure in one data center;



Total DCI link failure;



Total data center failure. Even though the network design provides automatic failover mechanism on data center failure, you might still need manual procedures to move active storage units or to migrate VM workloads following a total data center failure.

Stateful devices (firewalls, load balancers) are not included in this design. Each stateful device partitions the data center network in two (or more) independent components. You can apply the mechanisms described in this document to the individual networks; migration of stateful devices following a data center failure is out of scope.

SOLUTION OVERVIEW External data center routing seems to be a simple primary/backup design scenario (more details in Figure 5-4): 

Primary data center advertises a subnet with low cost (when using BGP, cost might be AS-path length or multi-exit discriminator attribute);

© Copyright ipSpace.net 2014

Page 5-8

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars



Backup data center advertises the same subnet with high cost – even if the DCI link fails, every external router ignores the prefix advertised by the backup data center due to its higher cost.

Figure 5-4: Primary/backup external routing

The primary/backup approach based on routing protocol costs works reasonably well in enterprise WAN network where ACME controls the routing policies, but fails in generic Internet environment, where ACME cannot control routing policies implemented by upstream ISPs, and where every ISP might use its own (sometimes even undocumented) routing policy. For example, an upstream ISP might strictly prefer prefixes received from its customers over prefixes received from other autonomous systems (peers or upstream ISPs); such an ISP would set © Copyright ipSpace.net 2014

Page 5-9

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

local preference on BGP paths received from its customers, making AS path length irrelevant. Routing policy that unconditionally prefers customer prefixes might prevent a straightforward implementation of primary/backup scenario based on routing protocol cost (ex: AS path length). The only reliable mechanism to implement primary/backup path selection that does not rely on ISP routing policies is conditional route advertisement – BGP routers in backup data center should not advertise prefixes from primary data center unless the primary data center fails or all its WAN connections fail. To further complicate the design, BGP routers in the backup data center (for a specific subnet) shall not advertise the prefixes currently active in the primary data center when the DCI link fails. Data center edge routers thus have to employ mechanisms similar to those used by data center switches with a shared control plane (ex: Cisco’s VSS or HP’s IRF): they have to detect split brain scenario by exchanging keepalive messages across the external network. When the backup router (for a particular subnet) cannot reach the primary router through the DCI link but still reaches it across the external network, it must enter isolation state (stop advertising the backup prefix). You can implement the above requirements using neighbor advertise-map functionality available in Cisco IOS in combination with IP SLA-generated routes (to test external reachability of the other data center), with Embedded Event Manager (EEM) triggers, or with judicious use of parallel IBGP sessions (described in the Detailed Solution – Internet Routing With BGP section).

© Copyright ipSpace.net 2014

Page 5-10

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

DETAILED SOLUTION – OSPF Data center subnets could be advertised into OSPF routing protocol used by ACME enterprise WAN networks as: 

Intra-area routes;



Inter-area routes;



External OSPF routes.

External Type-2 OSPF routes are the only type of OSPF routes where the internal cost (OSPF cost toward the advertising router) does not affect the route selection process. It’s thus advisable to advertise data center subnets as E2 OSPF routes. External route cost should be set to a low value (ex: 100) on data center routers advertising primary subnet and to a high value (ex: 1000) on data center routers advertising backup subnet (Figure 5-5):

© Copyright ipSpace.net 2014

Page 5-11

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 5-5: OSPF routing used in enterprise WAN network

FAILURE ANALYSIS Consider the following failure scenarios (assuming DC-A is the primary data center and DC-B the backup one): 

DC-A WAN link failure: DC-B is still advertising the subnet into enterprise WAN (although with higher cost). Traffic from DC-A flows across the DCI link, which might be suboptimal. Performance problems might trigger evacuation of DC-A, but applications running in DC-A remain reachable throughout the failure period.

© Copyright ipSpace.net 2014

Page 5-12

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars



DC-B WAN link failure: does not affect the OSPF routing (prefix advertised by DC-B was too expensive).



DCI link failure: Does not affect the applications running in DC-A. VMs residing in DC-B (within the backup part of the shared subnet) will be cut off from the rest of the network.

MIGRATION SCENARIOS Use the following procedures when performing a controlled migration from DC-A to DC-B: 

DC evacuation (primary to backup). Migrate the VMs, decrease the default-metric on the DC-B routers (making DC-B the primary data center for the shared subnet). Reduced cost of prefix advertised by DC-B will cause routers in the enterprise WAN network to prefer path through DC-B. Shut down DC-A.



DC restoration (backup to primary). Connect DC-A to the WAN networks (cost of prefixes redistributed into OSPF in DC-A is still higher than the OSPF cost advertised from DC-B). Migrate the VMs, increase the default-metric on routers in DC-B. Prefix advertised by DC-A will take over.

© Copyright ipSpace.net 2014

Page 5-13

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

DETAILED SOLUTION – INTERNET ROUTING WITH BGP The BGP prefix origination toward public Internet will be solved with a relatively simple design that uses additional IBGP sessions established between external addresses of data center edge routers (Figure 5-6): 

Regular IBGP sessions are established between data center edge routers (potentially in combination with external WAN backbone described in the Redundant Data Center Internet Connectivity document38). These IBGP sessions could be configured between loopback or internal LAN interfaces39;



Additional IBGP sessions are established between external (ISP-assigned) IP addresses of data center edge routers. The endpoints of these IBGP sessions shall not be advertised in internal routing protocols to ensure the IBGP sessions always traverse the public Internet.

38

Available as a case study in ipSpace.net Webinar Management System

39

BGP Essentials: Configuring Internal BGP Sessions http://blog.ioshints.info/2008/01/bgp-essentials-configuring-internal-bgp.html

© Copyright ipSpace.net 2014

Page 5-14

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 5-6: EBGP and IBGP sessions on data center edge routers

IBGP sessions established across the public Internet should be encrypted. If you cannot configure an IPsec session between the BGP routers, use MD5 authentication to prevent man-in-the-middle or denial-of-service attacks.

© Copyright ipSpace.net 2014

Page 5-15

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

PREFIX ORIGINATION With all BGP routers advertising the same prefixes, we have to use BGP local preference to select the best prefix: 

BGP prefixes advertised by routers in primary data center have default local preference (100);



BGP prefixes advertised by routers in backup data center have lower local preference (50). The routers advertising backup prefixes (with a network or redistribute router confirmation command) shall also set the BGP weight to zero to make locally-originated prefixes comparable to other IBGP prefixes.

Furthermore, prefixes with default local preference (100) shall get higher local preference (200) when received over Internet-traversing IBGP session (see Figure 5-7):

© Copyright ipSpace.net 2014

Page 5-16

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

... Figure 5-7: BGP local preference in prefix origination and propagation

NEXT-HOP PROCESSING IBGP sessions between ISP-assigned IP addresses shall not influence actual packet forwarding. The BGP next hop advertised over these sessions must be identical to the BGP next hop advertised over DCI-traversing IBGP sessions. Default BGP next hop processing might set BGP next hop for locally-originated directly connected prefixes to the local IP address of the IBGP session (BGP next hop for routes redistributed into BGP from other routing protocols is usually set to the next-hop provided by the source routing protocol).

© Copyright ipSpace.net 2014

Page 5-17

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

If the BGP origination process does not set the BGP next hop (BGP next hop for locally originated prefixes equals 0.0.0.0), you must set the value of the BGP next hop to one of the internal IP addresses of the BGP router (loopback or internal LAN IP address). Use set next-hop command in a route-map attached to network or redistribute router configuration command. You might also change the BGP next hop with an outbound route-map applied to Internet-traversing IBGP session (Figure 5-8):

Figure 5-8: BGP next hop processing

© Copyright ipSpace.net 2014

Page 5-18

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

BGP TABLE DURING NORMAL OPERATIONS BGP routers advertising a primary subnet might have several copies of the primary subnet in their BGP table: 

Locally-originated prefix with local preference 100;



Prefixes received from backup data center with local preference 50.

After BGP converges, the prefixes originated in the backup data center (for a specific subnet) should no longer be visible in the BGP tables of routers in the primary data center; routers in backup data center should revoke them due to their lower local preference. BGP routers in the backup data center should have three copies of the primary subnet in their BGP table: 

Locally-originated prefix with local preference 50;



Prefix received from primary data center over DCI-traversing IBGP session with local preference 100;



Prefix received from primary data center over Internet-traversing IBGP session with local preference 200 (local preference is set to 200 by receiving router).

BGP routers in the backup data center should thus prefer prefixes received over Internet-traversing IBGP session. As these prefixes have the same next hop as prefixes received over DCI-traversing IBGP session (internal LAN or loopback interface of data center edge routers), the actual packet forwarding is not changed.

© Copyright ipSpace.net 2014

Page 5-19

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

EBGP ROUTE ADVERTISEMENTS Data center edge routers advertise the best route from their BGP table to external BGP peers. The best route might be: 

Locally-originated prefix. The router is obviously the best source of routing information – either because it’s the primary router for the subnet or because the primary data center cannot be reached through either DCI link or Internet.



IBGP prefix with local preference 100. Prefix with this local preference can only be received from primary data center (for the prefix) over DCI-traversing IBGP session. Lack of better path (with local preference 200) indicates failure of Internet-traversing IBGP session, probably caused by Internet link failure in the primary data center. Prefix should be advertised with prepended AS path40.



IBGP prefix with local preference 200. Prefix was received from the primary data center (for the prefix) through Internet-traversing IBGP session, indicating primary data center with fully operational Internet connectivity. Prefix must not be advertised to EBGP peers as it’s already advertised by the primary data center BGP routers.

Summary of outbound EBGP route map: 

Locally-originated prefix  advertise



IBGP prefix with local preference 100  advertise



IBGP prefix with local preference 200  drop

40

BGP Essentials: AS Path Prepending http://blog.ioshints.info/2008/02/bgp-essentials-as-path-prepending.html

© Copyright ipSpace.net 2014

Page 5-20

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

BGP communities41 might be used to ease the differentiation between locally-originated and other IBGP prefixes.

FAILURE ANALYSIS Assume DC-A is the primary data center for a given prefix. 

DC-A Internet link failure: Internet-traversing IBGP session fails. BGP routers in DC-B start advertising the prefix from DC-A (its local preference has dropped to 100 due to IBGP session failure).



DC-A BGP router failure: BGP routers in DC-B lose all prefixes from DC-A and start advertising locally-originated prefixes for the shared subnet.



DCI failure: Internet-traversing IBGP session is still operational. BGP routers in DC-B do not advertise prefixes from DC-A. No traffic is attracted to DC-B.



Total DC-A failure: All IBGP sessions between DC-A and DC-B are lost. BGP routers in DC-B advertise local prefixes, attracting user traffic toward servers started in DC-B during the disaster recovery procedures.



End-to-end Internet connectivity failure: Internet-traversing IBGP session fails. BGP routers in DC-B start advertising prefixes received over DCI-traversing IBGP session with prepended AS-path. Traffic for subnet currently belonging to DC-A might be received by DC-B but will still be delivered to the destination host as long as the DCI link is operational.

41

BGP Essentials: BGP Communities http://blog.ioshints.info/2008/02/bgp-essentials-bgp-communities.html

© Copyright ipSpace.net 2014

Page 5-21

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars



EBGP session failure in DC-A: Prefixes from DC-A will not be advertised by either DC-A (because EBGP session is gone) or DC-B (because the Internet-traversing IBGP session is still operational). You might use neighbor advertise-map on Internet-traversing IBGP session to ensure prefixes are sent on that session only if the local BGP table contains external prefixes (indicating operational EBGP session). If your routers support Bidirectional Forwarding Detection (BFD)42 over IBGP sessions, use it to speed up the convergence process.

MIGRATION SCENARIOS Use the following procedures when performing a controlled migration from DC-A to DC-B: 

DC evacuation (primary to backup). Migrate the VMs, decrease the default local preference on DC-A routers to 40. Even though these prefixes will be received over Internet-traversing IBGP session by BGP routers in DC-B, their local preference will not be increased. Prefixes originated by DC-B will thus become the best prefixes and will be advertised by both data centers. Complete the evacuation by shutting down EBGP sessions in DC-A. Shut down DC-A.

42

Bidirectional Forwarding Detection http://wiki.nil.com/Bidirectional_Forwarding_Detection_(BFD)

© Copyright ipSpace.net 2014

Page 5-22

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars



DC restoration (backup to primary). Connect DC-A to the WAN networks, enable EBGP sessions, change the default local preference to 100. After IBGP convergence completes, DC-B stops advertising prefixes from DC-A.

CONCLUSIONS Optimal external routing that avoids split-brain scenarios is relatively easy to implement in WAN networks with consistent routing policy: advertise each subnet with low cost (or shorter AS-path or lower value of multi-exit discriminator in BGP-based networks) from the primary data center (for that subnet) and with higher cost from the backup data center. In well-designed active-active data center deployments each data center acts as the primary data center for the subset of prefixes used by applications running in that data center. Optimal external routing toward the Internet is harder to implement due to potentially inconsistent routing policies used by individual ISPs. The only solution is tightly controlled conditional route advertisement: routers in backup data center (for a specific prefix) should not advertise the prefix as long as the primary data center retains its Internet connectivity. This requirement could be implemented with numerous scripting mechanisms available in modern routers; this document presented a cleaner solution that relies exclusively on standard BGP mechanisms available in most modern BGP implementations.

© Copyright ipSpace.net 2014

Page 5-23

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

6

DESIGNING A PRIVATE CLOUD NETWORK INFRASTRUCTURE

IN THIS CHAPTER: COLLECT THE REQUIREMENTS PRIVATE CLOUD PLANNING AND DESIGN PROCESS DESIGN DECISIONS

DESIGNING THE NETWORK INFRASTRUCTURE MANAGEMENT NETWORK

CONCLUSIONS

© Copyright ipSpace.net 2014

Page 6-1

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

The data center networking team in a large enterprise (the Customer) has been tasked with building the network infrastructure for a new private cloud deployment. They approached numerous vendors trying to figure out how the new network should look like, and got thoroughly confused by all the data center fabric offerings, from FabricPath (Cisco) and VCS Fabric (Brocade) to Virtual Chassis Fabric (Juniper), QFabric (Juniper) and more traditional leaf-andspine architectures (Arista). Should they build a layer-2 fabric, a layer-3 fabric or a leaf-and-spine fabric?

COLLECT THE REQUIREMENTS Talking with vendors without knowing the actual network requirements is a waste of time – the networking team can start designing the network when they collect (at least) the following requirements: 

End-to-end connectivity requirements (L2 or L3 connectivity between edge ports);



Required services (IP transport, lossless IP transport and/or FCoE transport);



Total number of edge ports (GE/10GE/FC/FCoE)



Total north-south (traffic leaving the data center) and east-west (inter-server traffic) bandwidth.

© Copyright ipSpace.net 2014

Page 6-2

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

These requirements can only be gathered after the target workload has been estimated in terms of bandwidth, number of servers and number of tenants. Long-term average and peak statistics of existing virtualized or physical workload behavior are usually a good initial estimate of the target workload. The Customer has collected these statistics using VMware vCenter Operations Manager: Category

Collected values

VM, CPU, RAM and



20 hosts

storage statistics



80 cores (@ 50% CPU)



1 TB RAM (@ 82% utilization)



40 TB of storage



500 VMs

Average VM



0.2 core per VM (allocated)

requirements



0.08 core per VM (actual usage)



2 GB of RAM per VM



80 GB of disk per VM

Bandwidth and



250 IOPS per vSphere host

IOPS statistics



7 MBps (~ 60 Mbps) of storage traffic per host



2 MBps per VM (less than 20 Mbps per VM)

© Copyright ipSpace.net 2014

Page 6-3

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

The Customer is expecting a reasonably fast growth in the workload and thus decided to build a cloud infrastructure that will eventually support up to 5 times larger workload. They have also increased the expected average VM requirements. Category

Target workload

Average VM



0.3 core per VM

requirements



4 GB of RAM per VM



200 GB of disk per VM



20 IOPS per VM



50 Mbps of storage traffic per VM



50 Mbps of network traffic per VM



2500 VMs

Workload size

The total workload requirements are thus: Parameter CPU cores

Value 750

RAM

10 TB

IOPS

50.000

© Copyright ipSpace.net 2014

Page 6-4

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Parameter

Value

Storage bandwidth

125 Gbps

Total network bandwidth

125 Gbps

External network bandwidth

2 * 10GE WAN uplinks

PRIVATE CLOUD PLANNING AND DESIGN PROCESS Planning and design of a new (private or public) cloud infrastructure should follow these logical steps: 

Define the services offered by the cloud. Major decision points include IaaS versus PaaS and simple hosting versus support for complex application stacks43.



Select the orchestration system (OpenStack, CloudStack, vCloud Director…) that will allow the customers to deploy these services;



Select the hypervisor supported by the selected orchestration system that has the desired features (example: high-availability);



Select optimal server hardware based on workload requirements;



Select the network services implementation (physical or virtual firewalls and load balancers);



Select the virtual networking implementation (VLANs or overlay virtual networks);

43

Does it make sense to build new clouds with overlay networks? http://blog.ipspace.net/2013/12/does-it-make-sense-to-build-new-clouds.html

© Copyright ipSpace.net 2014

Page 6-5

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars



Design the network infrastructure based on the previous selections. Every step in the above process requires a separate design; those designs are not covered in this document, as we only need their results in the network infrastructure design phase.

DESIGN DECISIONS The Customer’s private cloud infrastructure will use vCloud Automation Center and vSphere hypervisors. The server team decided to use the Nutanix NX-3050 servers with the following specifications: Parameter

Value

CPU cores

16 cores

RAM

256 GB

IOPS

6000

Connectivity

2 * 10GE uplink

© Copyright ipSpace.net 2014

Page 6-6

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

The target workload can be placed on 50 NX-3050 servers (based on the number of CPU cores). Those servers would have 12.800 GB of RAM (enough), 1 Tbps of network bandwidth (more than enough) and 300.000 IOPS (more than enough). Switch vendors use “marketing math” – they count ingress and egress bandwidth on every switch port. The Nutanix server farm would have 2 Tbps of total network bandwidth using that approach. The private cloud will use a combination of physical (external firewall) and virtual (per-application firewalls and load balancers) network services44. The physical firewall services will be implemented on two devices in active/backup configuration (two 10GE ports each); virtual services will be run on a separate cluster45 of four hypervisor hosts, for a total of 54 servers. The number of network segments in the private cloud will be relatively low. VLANs will be used to implement the network segments; the network infrastructure thus has to provide layer-2 connectivity between any two endpoints. This decision effectively turns the whole private cloud infrastructure into a single failure domain. Overlay virtual networks would be a more stable alternative (from the network perspective), but are not considered mature enough technology by more conservative cloud infrastructure designers.

44

Combine physical and virtual appliances in a private cloud http://blog.ipspace.net/2014/02/combine-physical-and-virtual-appliances.html

45

Cloud-as-an-appliance design http://blog.ipspace.net/2013/07/cloud-as-appliance-design.html

© Copyright ipSpace.net 2014

Page 6-7

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

The network infrastructure thus has to provide: 

End-to-end VLANs (layer-2 fabric);



IP connectivity (no lossless Ethernet or FCoE);



108 10GE server ports;



4 10GE server ports for physical network services appliances;



2 10GE WAN uplinks.

DESIGNING THE NETWORK INFRASTRUCTURE The network infrastructure for the new private cloud deployment will be implemented with a layer2+3 switching fabric providing equidistant bandwidth to 112 10GE server and appliance ports, and 4 10GE uplinks (connected to existing data center network or WAN edge routers). Most modern data center switches offer wire-speed layer-3 switching. The fabric will thus offer layer-2+3 switching even though the network design requirements don’t include layer3 switching. The required infrastructure can be implemented with just two 10GE ToR switches. Most data center switching vendors (Arista, Brocade, Cisco, Dell Force10, HP, Juniper) offer switches with 48 10GE ports and 16 40GE ports that can be split into four 10GE ports (for a total of 64 10GE ports). Two 40GE ports on each switch would be used as 10GE ports (for a total of 56 10GE ports per switch), the remaining two 40GE ports would be used for an inter-switch link. Alternatively, Cisco Nexus 5672 has 48 10GE ports and 6 40GE ports, for a total of 72 10GE ports (giving you a considerable safety margin); Arista 7050SX-128 has 96 10GE and 8 40GE ports.

© Copyright ipSpace.net 2014

Page 6-8

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Modular switches from numerous vendors have significantly higher number of 10GE ports, allowing you to build even larger 2-switch fabrics (called splines by Arista marketing). Every data center switching vendor can implement ECMP layer-2 fabric with no blocked links using multi-chassis link aggregation (Arista: MLAG, Cisco: vPC, HP: IRF, Juniper: MC-LAG). Some vendors offer layer-2 fabric solutions that provide optimal end-to-end forwarding across larger fabrics (Cisco FabricPath, Brocade VCS Fabric, HP TRILL), other vendors allow you to merge multiple switches into a single management-plane entity (HP IRF, Juniper Virtual Chassis, Dell Force10 stacking). In any case, it’s not hard to implement end-to-end layer-2 fabric with ~100 10GE ports.

MANAGEMENT NETWORK A mission-critical data center infrastructure should have a dedicated out-of-band management network disconnected from the user and storage data planes. Most network devices and high-end servers have dedicated management ports that can be used to connect these devices to a separate management infrastructure. The management network does not have high bandwidth requirements (most devices have Fast Ethernet or Gigabit Ethernet management ports); you can build it very effectively with a pair of GE switches. Do not use existing ToR switches or fabric extenders (FEX) connected to existing ToR switches to build the management network.

© Copyright ipSpace.net 2014

Page 6-9

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

The purpose of the management network is to reach infrastructure devices (including ToR switches) even when the network infrastructure malfunctions or experiences forwarding loops and resulting brownouts or meltdowns.

CONCLUSIONS One cannot design an optimal network infrastructure without a comprehensive set of input requirements. When designing a networking infrastructure for a private or public cloud these requirements include: 

Network services implementation (physical or virtual);



Virtual network segments implementation (VLANs or overlay virtual networks);



Transport services offered by the networking infrastructure (VLANs, IP, lossless Ethernet, FCoE …);



Total number of edge ports by technology (GE, 10GE, FC, FCoE);



Total east-west and north-south bandwidth;

Most reasonably sized private cloud deployments require few tens of high-end physical servers and associated storage – either distributed or in form of storage arrays. You can implement the network infrastructure meeting these requirements with two ToR switches having between 64 10GE and 128 10GE ports.

© Copyright ipSpace.net 2014

Page 6-10

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

7

REDUNDANT SERVER-TO-NETWORK CONNECTIVITY

IN THIS CHAPTER: DESIGN REQUIREMENTS VLAN-BASED VIRTUAL NETWORKS REDUNDANT SERVER CONNECTIVITY TO LAYER-2 FABRIC OPTION 1: NON-LAG SERVER CONNECTIVITY OPTION 2: SERVER-TO-NETWORK LAG

OVERLAY VIRTUAL NETWORKS OPTION 1: NON-LAG SERVER CONNECTIVITY OPTION 2: LAYER-2 FABRIC OPTION 3: SERVER-TO-NETWORK LAG

CONCLUSIONS © Copyright ipSpace.net 2014

Page 7-1

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

A large enterprise (the Customer) is building a private cloud infrastructure using leaf-and-spine fabric for internal network connectivity. The virtualization team hasn’t decided yet whether to use a commercial product (example: VMware vSphere) or an open-source alternative (KVM with OpenStack). It’s also unclear whether VLANs or overlay layer-2 segments will be used to implement virtual networks. Regardless of the virtualization details, the server team wants to implement redundant server-tonetwork connectivity: each server will be connected to two ToR switches (see Figure 7-1). The networking team has to build the network infrastructure before having all the relevant input data – the infrastructure should thus be as flexible as possible.

Figure 7-1: Redundant server-to-network connectivity

© Copyright ipSpace.net 2014

Page 7-2

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

DESIGN REQUIREMENTS The virtualization solution deployed in the private cloud may use VLANs as the virtual networking technology  Leaf-and-spine fabric deployed by the networking team MUST support layer-2 connectivity between all attached servers. Overlay virtual networks may be used in the private cloud, in which case a large layer-2 failure domain is not an optimal solution  Leaf-and-spine fabric SHOULD also support layer-3 connectivity with a separate subnet assigned to each ToR switch (or a redundant pair of ToR switches).

VLAN-BASED VIRTUAL NETWORKS If the virtualization (or the networking) team decides to use VLANs to implement virtual subnets, then the physical fabric has to be able to stretch a VLAN across all ports connecting servers with VMs belonging to a particular virtual subnet. The networking team can build a layer-2-only leaf-and-spine fabric with two core switches using multi-chassis link aggregation (MLAG – see Figure 7-2) or they could deploy a layer-2 multipath technology like FabricPath, VCS Fabric or TRILL as shown in Figure 7-3. See Option 2: Server-to-Network LAG section for more MLAG details.

© Copyright ipSpace.net 2014

Page 7-3

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 7-2: Layer-2 fabric with two spine nodes

Figure 7-3: Layer-2 leaf-and-spine fabric using layer-2 ECMP technology

© Copyright ipSpace.net 2014

Page 7-4

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

The choice of the layer-2 fabric technology depends primarily on the size of the fabric (are two core switches enough for the planned number of server ports?) and the vendor supplying the networking gear (most major data center vendors have proprietary layer-2 fabric architectures).

REDUNDANT SERVER CONNECTIVITY TO LAYER-2 FABRIC When they hypervisor hosts use VLANs to implement virtual networks, the physical switches see the MAC addresses of all VMs deployed in the private cloud. The hypervisor hosts could therefore assign a subset of VMs to every uplink and present each uplink as an individual independent server-tonetwork link as shown in Figure 7-4 (example: vSphere Load Based Teaming).

Figure 7-4: VMs pinned to a hypervisor uplink

Only the network edge switches see MAC addresses of individual hosts in environments using Provider Backbone Bridging (PBB) or TRILL/FabricPath-based fabrics.

© Copyright ipSpace.net 2014

Page 7-5

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Alternatively, the hypervisor hosts could bundle all uplinks into a link aggregation group (LAG) and spread the traffic generated by the VMs across all the available uplinks (see Figure 7-5).

Figure 7-5: Server-to-network links bundled in a single LAG

OPTION 1: NON-LAG SERVER CONNECTIVITY Non-LAG server-to-network connectivity is the easiest connectivity option and is available on most hypervisor platforms. Hypervisor hosts don’t have to support LACP (or any other LAG protocol), and the ToR switches don’t have to support MLAG. The only caveat of non-LAG server-to-network connectivity is suboptimal traffic flow. Let’s consider two hypervisor hosts connected to the same pair of ToR switches as shown in (see Figure 7-6).

© Copyright ipSpace.net 2014

Page 7-6

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 7-6: VM-to-uplink pinning with two hypervisor hosts connected to the same pair of ToR switches

Even though the two hypervisors could communicate directly, the traffic between two VMs might have to go all the way through the spine switches (see Figure 7-7) due to VM-to-uplink pinning which presents a VM MAC address on a single server uplink.

© Copyright ipSpace.net 2014

Page 7-7

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 7-7: Suboptimal traffic flow with VM-to-uplink pinning

Conclusion: If the majority of the expected traffic flows between virtual machines and the outside world (North-South traffic), non-LAG server connectivity is ideal. If the majority of the traffic flows between virtual machines (East-West traffic) then the non-LAG design is clearly suboptimal unless the chance of VMs residing on co-located hypervisors is exceedingly small (example: large cloud with tens or even hundreds of ToR switches).

© Copyright ipSpace.net 2014

Page 7-8

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

MLAG SWITCH PAIR IN NON-LAG SERVER CONNECTIVITY SCENARIO Introducing MLAG pairs (or stackable ToR switches) in an environment without server-to-switch LAG decreases the overall network performance. Switches in an MLAG group treat non-LAG ports as orphan ports (links to servers that should be connected to all switches but aren’t). Switches in an MLAG group try to keep traffic arriving on orphan ports and destined to other orphan ports within the MLAG group. Such traffic thus traverses the intra-stack (or peer) links instead of leaf-and-spine links46 as shown in Figure 7-8.

Figure 7-8: Traffic flow between orphan ports

46

vSwitch in MLAG environments http://blog.ipspace.net/2011/01/vswitch-in-multi-chassis-link.html

© Copyright ipSpace.net 2014

Page 7-9

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

You might have to increase the bandwidth of intra-stack links to cope with the increased amount of east-west traffic (leaf-to-spine bandwidth in well-designed Clos fabrics is usually significantly higher than intra-stack bandwidth), but it’s way easier to remove MLAG pairing (or stacking) between ToR switches and dedicate all non-server-facing ports to leaf-to-spine uplinks47. Conclusion: Do not use MLAG or switch stacking in environments with non-LAG server-to-network connectivity.

OPTION 2: SERVER-TO-NETWORK LAG Hypervisor software might bundle all server-to-network uplinks in a single link aggregation group (LAG), resulting in optimal traffic flow from the server to the ToR switch (all VMs could use the aggregate bandwidth of all uplinks). Most LAG solutions place traffic generated by a single TCP session onto a single uplink, limiting the TCP session throughput to the bandwidth of a single uplink interface. Dynamic NIC teaming available in Windows Server 2012 R2 can split a single TCP session into multiple flowlets and distribute them across all uplinks. Ethernet LAG was designed to work between a single pair of devices – bundling links connected to different ToR switches requires Multi-Chassis Link Aggregation (MLAG) support in ToR switches48.

47

Link aggregation with stackable data center ToR switches http://blog.ipspace.net/2011/01/vswitch-in-multi-chassis-link.html

48

Multi-Chassis Link Aggregation (MLAG) basics http://blog.ipspace.net/2010/10/multi-chassis-link-aggregation-basics.html

© Copyright ipSpace.net 2014

Page 7-10

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

You could configure a link aggregation group between a server and a pair of ToR switches as a regular port channel (or LAG) using Link Aggregation Control Protocol (LACP) to manage the LAG (see Figure 7-9), or static port channel without LACP.

Figure 7-9: LACP between a server and ToR switches

Static port channel is the only viable alternative when using older hypervisors (example: vSphere 5.0), but since this option doesn’t use a handshake/link monitoring protocol, it’s impossible to detect wiring mistakes or misbehaving physical interfaces. Static port channels are thus inherently unreliable and should not be used if at all possible. Switches participating in an MLAG group (or stack) exchange the MAC addresses received from the attached devices, and a switch receiving a packet for a destination MAC address reachable over a LAG link always uses a local member of a LAG link to reach the destination49 (see Figure 7-10). A 49

MLAG and hot potato switching http://blog.ipspace.net/2010/12/multi-chassis-link-aggregation-mlag-and.html

© Copyright ipSpace.net 2014

Page 7-11

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

design with servers dual-connected with LAGs to pairs of ToR switches therefore results in optimal traffic flow regardless of VM placement and eventual VM-to-uplink pinning done by the hypervisors.

Figure 7-10: Optimal traffic flow with MLAG

The only drawback of the server-to-network LAG design is the increased complexity introduced by MLAG groups.

© Copyright ipSpace.net 2014

Page 7-12

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

OVERLAY VIRTUAL NETWORKS Overlay virtual networks use MAC-over-IP encapsulation to transport MAC frames generated by VMs across a physical IP fabric. Most implementations use a single hypervisor IP address as the source or destination IP address of the encapsulated packets, requiring all ToR switches to which a servers is connected to be in the same IP subnet (and consequently in the same VLAN and layer-2 failure domain) as displayed in Figure 7-11.

Figure 7-11: Redundant server connectivity requires the same IP subnet on adjacent ToR switches

© Copyright ipSpace.net 2014

Page 7-13

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

OPTION 1: NON-LAG SERVER CONNECTIVITY Hypervisor hosts offer several packet distribution mechanisms to send traffic from a single kernel IP address over multiple non-aggregated uplinks: 

vSphere uses a single active uplink for each vKernel interface.



Linux bonding driver can send traffic from a single IP address through multiple uplinks using one or more MAC addresses (see Linux Bonding Driver Implementation Details).

In most setups the hypervisor associates its IP address with a single MAC address (ARP replies sent by the hypervisor use a single MAC address), and that address cannot be visible over more than a single server-to-switch link (or LAG). Most switches would report MAC address flapping when receiving traffic from the same source MAC address through multiple independent interfaces. The traffic toward the hypervisor host (including all encapsulated virtual network traffic) would thus use a single server-to-switch link (see Figure 7-12).

© Copyright ipSpace.net 2014

Page 7-14

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 7-12: A single uplink is used without server-to-ToR LAG

The traffic sent from a Linux hypervisor host could use multiple uplinks (with a different MAC address on each active uplink) when the host uses balance-tlb or balance-alb bonding mode (see Linux Bonding Driver Implementation Details) as shown in Figure 7-13.

© Copyright ipSpace.net 2014

Page 7-15

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 7-13: All uplinks are used by a Linux host using balance-tlb bonding mode

SUBOPTIMAL TRAFFIC FLOW DUE TO LEAF-TO-SPINE ECMP ROUTING Most leaf-and-spine network designs rely heavily on ECMP load distribution between leaf and spine nodes; all ToR switches thus advertise directly connected subnets with the same metric. In a typical design (displayed in Figure 7-14) the spine switches send the traffic for target IP subnet to both

© Copyright ipSpace.net 2014

Page 7-16

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

ToR switches advertising the subnet, and since the server MAC addresses appear to be connected to a single ToR switch, the spine switches send half of the traffic to the wrong ToR switch.

Figure 7-14: All ToR switches advertise IP subnets with the same cost

Stackable switches are even worse. While it’s possible to advertise an IP subnet shared by two ToR switches with different metrics to attract the traffic to the primary ToR switch, the same approach doesn’t work with stackable switches, which treat all members of the stack as a single virtual IP router, as shown in Figure 7-15.

© Copyright ipSpace.net 2014

Page 7-17

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 7-15: IP routing with stackable switches

Conclusion: Do not use non-LAG server connectivity in overlay virtual networking environments.

LINUX BONDING DRIVER IMPLEMENTATION DETAILS Active-Backup mode of the Linux bonding driver50 uses a single active uplink, falling back to another (backup) uplink in case of active uplink failure. All the traffic is sent and received through the active uplink.

50

See Linux bonding driver documentation for more details https://www.kernel.org/doc/Documentation/networking/bonding.txt

© Copyright ipSpace.net 2014

Page 7-18

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Balance-tlb mode uses multiple source MAC addresses for the same IP address, but a single MAC address in ARP replies. Traffic from the server is sent through all active uplinks; return traffic uses a single (primary) uplink. This mode is obviously not optimal in scenarios with a large percentage of east-west traffic. Balance-alb mode replaces the MAC address in the ARP replies sent by the Linux kernel with one of the physical interface MAC addresses, effectively assigning different MAC addresses (and thus uplinks) to IP peers, and thus achieving rudimentary inbound load distribution. All other bonding modes (balance-rr, balance-xor, 802.3ad) use the same MAC address on multiple active uplnks and thus require port channel (LAG) configuration on the ToR switch to work properly.

OPTION 2: LAYER-2 FABRIC One could bypass the summarization properties of IP routing protocols (which advertise subnets and not individual host IP addresses) by using a layer-2 transport fabric between the hypervisor hosts shown in Figure 7-16.

© Copyright ipSpace.net 2014

Page 7-19

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 7-16: Layer-2 fabric between hypervisor hosts

All edge switches participating in a layer-2 fabric would have full MAC address reachability information and would be able to send the traffic to individual hypervisor hosts over an optimal path (assuming the fabric links are not blocked by Spanning Tree Protocol) as illustrated in Figure 7-17.

© Copyright ipSpace.net 2014

Page 7-20

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 7-17: Optimal flow of balance-tlb traffic across a layer-2 fabric

Layer-2 transport fabrics have another interesting property: they allow you to spread the load evenly across all ToR switches (and leaf-to-spine links) in environments using server uplinks in primary/backup mode – all you have to do is to spread the primary links across evenly across all ToR switches. Unfortunately a single layer-2 fabric represents a single broadcast and failure domain51 – using a layer-2 fabric in combination with overlay virtual networks (which don’t require layer-2 connectivity between hypervisor hosts) is therefore suboptimal from the resilience perspective.

51

Layer-2 network is a single failure domain http://blog.ipspace.net/2012/05/layer-2-network-is-single-failure.html

© Copyright ipSpace.net 2014

Page 7-21

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

OPTION 3: SERVER-TO-NETWORK LAG Dynamic link aggregation (using LACP) between a server and a pair of ToR switches (displayed in Figure 7-18) is the optimal edge design in routed leaf-and-spine networks with redundant server connectivity: 

Layer-2 domains are small (two ToR switches share a VLAN);



ToR switches can reach server MAC addresses directly (switches in an MLAG group exchange MAC addresses learned from traffic received on port channel interfaces);



Servers can send encapsulated traffic across all uplinks  flow of northbound (server-tonetwork) traffic is optimal;



Both ToR switches can send the traffic to adjacent servers directly  flow of southbound (network-to-server) traffic is optimal.

Figure 7-18: LAG between a server and adjacent ToR switches

© Copyright ipSpace.net 2014

Page 7-22

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

The LAGs used between servers and switches should use LACP to prevent traffic blackholing (see Option 2: Server-to-Network LAG for details), and the servers and the ToR switches should use 5tuple load balancing. VXLAN and STT encapsulations (VXLAN, STT) use source ports in UDP or TCP headers to increase the packet entropy and the effectiveness of ECMP load balancing. Most other encapsulation mechanisms use GRE transport, effectively pinning the traffic between a pair of hypervisors to a single path across the network.

CONCLUSIONS The most versatile leaf-and-spine fabric design uses dynamic link aggregation between servers and pairs of ToR switches. This design requires MLAG functionality on ToR switches, which does increase the overall network complexity, but the benefits far outweigh the complexity increase – the design works well with layer-2 fabrics (required by VLAN-based virtual networks) or layer-3 fabrics (recommended for transport fabrics for overlay virtual networks) and usually results in optimal traffic flow (the only exception being handling of traffic sent toward orphan ports – this traffic might have to traverse the link between MLAG peers). You might also use layer-2 fabrics without server-to-network link aggregation for VLAN-based virtual networks where hypervisors pin VM traffic to one uplink or for small overlay virtual networks when you’re willing to trade resilience of a layer-3 fabric for reduced complexity of non-MLAG server connectivity. Finally, you SHOULD NOT use non-MLAG server connectivity in layer-3 fabrics or MLAG (or stackable switches) in layer-2 environments without server-to-switch link aggregation.

© Copyright ipSpace.net 2014

Page 7-23

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

8

REPLACING THE CENTRAL FIREWALL

IN THIS CHAPTER: FROM PACKET FILTERS TO STATEFUL FIREWALLS DESIGN ELEMENTS PROTECTING THE SERVERS WITH PACKET FILTERS PROTECTING VIRTUALIZED SERVERS WITH STATEFUL FIREWALLS PER-APPLICATION FIREWALLS PACKET FILTERS AT WAN EDGE

DESIGN OPTIONS BEYOND THE TECHNOLOGY CHANGES

© Copyright ipSpace.net 2014

Page 8-1

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

ACME Inc. has a data center hosting several large-scale web applications. Their existing data center design uses traditional enterprise approach: 

Data center is segmented into several security zones (web servers, application servers, database servers, supporting infrastructure);



Servers belonging to different applications reside within the same security zone, increasing the risk of lateral movements in case of web- or application server breach;



Large layer-2 segments are connecting all servers in the same security zone, further increasing the risk of cross-protocol attack52;



All inter-zone traffic is controlled by a pair of central firewalls, which are becoming exceedingly impossible to manage;



The central firewalls are also becoming a chokepoint, severely limiting the growth of ACME’s application infrastructure.

The networking engineers designing next-generation data center for ACME would like to replace the central firewalls with iptables deployed on application servers, but are reluctant to do so due to potential security implications.

FROM PACKET FILTERS TO STATEFUL FIREWALLS The ACME engineers have to find the optimal mix of traffic filtering solutions that will: 

Satisfy the business-level security requirements of ACME Inc., including potential legal, regulatory and compliance requirements;

52

Compromised security zone = Game Over http://blog.ipspace.net/2013/04/compromised-security-zone-game-over-or.html

© Copyright ipSpace.net 2014

Page 8-2

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars



Be easy to scale as the application traffic continues to grow;



Not require large-scale upgrades when the application traffic reaches a certain limit (which is the case with existing firewalls).

Effectively, they’re looking for a scale-out solution, which will ensure approximately linear growth, with minimum amount of state to reduce the complexity and processing requirements. While designing the overall application security architecture, they could use the following tools: Packet filters (or access control lists – ACLs) are the bluntest of traffic filtering tools: they match (and pass or drop) individual packets based on their source and destination network addresses and transport layer port numbers. They keep no state (making them extremely fast and implementable in simple hardware) and thus cannot check validity of transport layer sessions or fragmented packets. Some packet filters give you the option of permitting or dropping fragments based on network layer information (source and destination addresses), others either pass or drop all fragments (and sometimes the behavior is not even configurable). Packet filters are easy to use in server-only environments, but become harder to maintain when servers start establishing client sessions to other servers (example: application servers opening MySQL sessions to database servers). They are not the right tool in environments where clients establish ad-hoc sessions to random destination addresses (example: servers opening random sessions to Internet-based web servers). Packet filters with automatic reverse rules (example: XenServer vSwitch Controller) are a syntactic sugar on top of simple packet filters. Whenever you configure a filtering rule (example:

© Copyright ipSpace.net 2014

Page 8-3

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

permit inbound TCP traffic to port 80), the ACL configuration software adds a reverse rule in the other direction (permit outbound TCP traffic from port 80). ACLs that allow matches on established TCP sessions (typically matching TCP traffic with ACK or RST bit set) make it easier to match outbound TCP sessions. In server-only environment you can use them to match inbound TCP traffic on specific port numbers and outbound traffic of established TCP sessions (to prevent simple attempts to establish outbound sessions from hijacked servers); in client-only environment you can use them to match return traffic. Reflexive access lists (Cisco IOS terminology) are the simplest stateful tool in the filtering arsenal. Whenever a TCP or UDP session is permitted by an ACL, the filtering device adds a 5-tuple matching the return traffic of that session to the reverse ACL. Reflexive ACLs generate one filtering entry per transport layer session. Not surprisingly, you won’t find them in platforms that do packet forwarding and filtering in hardware – they would quickly overload the TCAM (or whatever forwarding/filtering hardware the device is using), cause packet punting to the main CPU53 and reduce the forwarding performance by orders of magnitude. Even though reflexive ACLs generate per-session entries (and thus block unwanted traffic that might have been permitted by other less-specific ACLs) they still work on individual packets and thus cannot reliably detect and drop malicious fragments or overlapping TCP segments. Transport layer session inspection combines reflexive ACLs with fragment reassembly and transport-layer validation. It should detect dirty tricks targeting bugs in host TCP/IP stacks like overlapping fragments or TCP segments.

53

Process, Fast and CEF Switching, and Packet Punting http://blog.ipspace.net/2013/02/process-fast-and-cef-switching-and.html

© Copyright ipSpace.net 2014

Page 8-4

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Application level gateways (ALG) add application awareness to reflexive ACLs. They’re usually used to deal with applications that exchange transport session endpoints54 (IP addresses and port numbers) in application payload (FTP or SIP are the well-known examples). An ALG would detect the requests to open additional data sessions and create additional transport-level filtering entries. Web Application Firewalls (WAF) have to go way beyond ALGs. ALGs try to help applications get the desired connectivity and thus don’t focus on malicious obfuscations. WAFs have to stop the obfuscators; they have to parse application-layer requests like a real server would to detect injection attacks55. Needless to say, you won’t find full-blown WAF functionality in reasonably priced highbandwidth firewalls.

DESIGN ELEMENTS ACME designers can use numerous design elements to satisfy the security requirements, including: 

Traffic filtering device protecting every server;



Stateful firewall protecting every server;



Per-application firewalls;



Packet filtering or stateful firewalling at WAN edge;

54

The FTP Butterfly Effect http://blog.ipspace.net/2010/03/ftp-butterfly-effect.html

55

Exploits of a Mom (aka Little Bobby Tables) http://xkcd.com/327/

© Copyright ipSpace.net 2014

Page 8-5

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

PROTECTING THE SERVERS WITH PACKET FILTERS The design in which a packet filter protects every server has the best performance and the best scalability properties of all the above-mentioned designs. Packet filters are fast (even software implementations can easily reach multi-gigabit speeds) and scale linearly (there’s no state to keep or synchronize).

Figure 8-19: Packet filters protecting individual servers

© Copyright ipSpace.net 2014

Page 8-6

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Per-server traffic filters (packet filters or firewalls) also alleviate the need for numerous security zones, as the protection of individual servers no longer relies on the zoning concept. In a properly operated environment one could have all servers of a single application stack (or even servers from multiple applications stacks) in a single layer-2 or layer-3 domain. Scale-out packet filters require a high level of automation – they have to be deployed automatically from a central orchestration system to ensure consistent configuration and prevent operator mistakes. In environments with extremely high level of trust in the server operating system hardening one could use iptables on individual servers. In most other environments it’s better to deploy the packet filters outside of the application servers – an intruder breaking into a server and gaining root access could easily turn off the packet filter. You could deploy packet filters protecting servers from the outside on first-hop switches (usually Top-of-Rack or End-of-Row switches), or on hypervisors in virtualized environment. Packet filters deployed on hypervisors are a much better alternative – hypervisors are not limited by the size of packet filtering hardware (TCAM), allowing the security team to write very explicit application-specific packet filtering rules permitting traffic between individual IP addresses instead of IP subnets (see also High Speed Multi-Tenant Isolation for more details). All major hypervisors support packet filters on VM-facing virtual switch interfaces: 

vSphere 5.5 and Windows Server 2012 R2 have built-in support for packet filters;



Linux-based hypervisors can use iptables in the hypervisor kernel, achieving the same results as using iptables in the guest VM in a significantly more secure way;

© Copyright ipSpace.net 2014

Page 8-7

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars



Cisco Nexus 1000V provides the same ACL functionality and configuration syntax in vSphere, Hyper-V and KVM environment. Environments using high-performance bare-metal servers could redeploy these servers as VMs in a single-VM-per-host setup, increasing deployment flexibility, ease upgrades, and provide traffic control outside of guest OS.

© Copyright ipSpace.net 2014

Page 8-8

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

PROTECTING VIRTUALIZED SERVERS WITH STATEFUL FIREWALLS Numerous hypervisor vendors (or vendors of add-on products) offer stateful firewall functionality inserted between the VM NIC and the adjacent virtual switch port.

Figure 8-20: VM NIC firewalls

© Copyright ipSpace.net 2014

Page 8-9

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Most VM NIC firewall products56 offer centralized configuration automatically providing the automated deployment of configuration changes mentioned in the previous section. The implementation details that affect the scalability or performance of VM NIC virtual firewalls vary greatly between individual products: 

Distributed firewall in VMware NSX, Juniper Virtual Gateway57, and Hyper-v firewalls using filtering functionality of Hyper-V Extensible Switch58 use in-kernel firewalls which offer true scale-out performance limited only by the number of available CPU resources;



vSphere App or Zones uses a single firewall VM per hypervisor host and passes all guest VM traffic through the firewall VM, capping the server I/O throughput to the throughput of a single core VM (3-4 Gbps);



Cisco Nexus 1000V sends the first packets of every new session to Cisco Virtual Security Gateway59, which might be deployed somewhere else in the data center, increasing the session setup delay. Subsequent packets of the same session are switched in the Nexus 1000V VEM module60 residing in the hypervisor kernel;

56

Virtual Firewall Taxonomy http://blog.ipspace.net/2012/11/virtual-firewall-taxonomy.html

57

Juniper Virtual Gateway – a Virtual Firewall Done Right http://blog.ipspace.net/2011/11/junipers-virtual-gateway-virtual.html

58

Hyper-V Extensible Virtual Switch http://blog.ipspace.net/2013/05/hyper-v-30-extensible-virtual-switch.html

59

Cisco Virtual Security Gateway (VSG) http://blog.ipspace.net/2012/10/cisco-virtual-security-gateway-vsg-101.html

60

What Exactly Is a Nexus 1000V? http://blog.ipspace.net/2011/06/what-exactly-is-nexus-1000v.html

© Copyright ipSpace.net 2014

Page 8-10

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars



HP TippingPoint vContoller sends all guest VM traffic through an external TippingPoint appliance, which becomes a bottleneck similar to centralized firewalls.

You should ask the following questions when comparing individual VM NIC firewall products: 

Is the filtering solution stateless or stateful?



Is the session state moved with the VM to another hypervisor, or is the state recreated from packets of already-established sessions inspected after the VM move?



Is there a per-hypervisor control VM?



Is the firewalling performed in the kernel module or in a VM?



Is a control VM involved in flow setup?



What happens when control VM fails?

© Copyright ipSpace.net 2014

Page 8-11

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

PER-APPLICATION FIREWALLS Per-application firewalls are a more traditional approach to application security: each application stack has its own firewall context (on a physical appliance) or a virtual firewall.

Figure 8-21: Per-application firewalls

Per-application firewalls (or contexts) significantly reduce the complexity of the firewall rule set – after all, a single firewall (or firewall context) contains only the rules pertinent to a single application. It is also easily removed at the application retirement time, automatically reducing the number of hard-to-audit stale firewall rules.

© Copyright ipSpace.net 2014

Page 8-12

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

VM-based firewalls or virtual contexts on physical firewalls are functionally equivalent to traditional firewalls and thus pose no additional technical challenges. They do require a significant change in deployment, management and auditing processes – like it’s impossible to run thousands of virtualized servers using mainframe tools, it’s impossible to operate hundreds of small firewalls using the processes and tools suited for centralized firewall appliances61. Virtual firewall appliances had significantly lower performance than their physical counterparts62. The situation changed drastically with the introduction of Xeon CPUs (and their AMD equivalents); the performance of virtual firewalls and load balancers is almost identical to entry-level physical products63.

PACKET FILTERS AT WAN EDGE Large web properties use packet filters, not stateful firewalls, at the WAN edge. All traffic sent to a server from a client is by definition unsolicited (example: TCP traffic to port 80 or 443), and a stateful firewall cannot add much value protecting a properly hardened operating system64.

61

Who will manage all those virtual firewalls? http://blog.ipspace.net/2013/12/omg-who-will-manage-all-those-virtual.html

62

Virtual network appliances: benefits and drawbacks http://blog.ipspace.net/2011/04/virtual-network-appliances-benefits-and.html

63

Virtual appliance performance is becoming a non-issue http://blog.ipspace.net/2013/04/virtual-appliance-performance-is.html

64

I don't need no stinking firewall, or do I? http://blog.ipspace.net/2010/08/i-dont-need-no-stinking-firewall-or-do.html

© Copyright ipSpace.net 2014

Page 8-13

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Firewalls were traditionally used to protect poorly written server TCP stacks. These firewalls would remove out-of-window TCP segments, certain ICMP packets, and perform IP fragment reassembly. Modern operating systems no longer need such protection. Packet filters permitting only well-known TCP and UDP ports combined with hardened operating systems offer similar protection as stateful firewalls; the real difference between the two is handling of outgoing sessions (sessions established from clients in a data center to servers outside of the data center). These sessions are best passed through a central proxy server, which can also provide application-level payload inspection.

© Copyright ipSpace.net 2014

Page 8-14

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 8-22: High-performance WAN edge packet filters combined with a proxy server

© Copyright ipSpace.net 2014

Page 8-15

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

The traffic filtering rules in this design become exceedingly simple: Filtering point

Rules

Inbound WAN



edge ACL

Outbound WAN

ports (example: ports 80 and 443); 

Permit UDP traffic to DNS and NTP servers;



Permit established TCP sessions from well-

edge ACL

Inbound Proxy

known application ports 

Permit DNS and NTP requests and responses



Permit traffic of established TCP sessions to

facing ACL Outbound Proxy facing ACL

© Copyright ipSpace.net 2014

Permit TCP traffic to well-known application

proxy server 

Permit TCP traffic from proxy server (proxy server uses internal DNS server)

Page 8-16

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

DESIGN OPTIONS ACME designers can combine the design elements listed in the previous section to satisfy the security requirements of individual applications: 

WAN edge packet filters combined with per-server (or VM NIC) packet filters are good enough for environments with well-hardened servers or low security requirements;



WAN edge packet filters combined with per-application firewalls are an ideal solution for security-critical applications in high-performance environment; A high-performance data center might use packet filters in front of most servers and perapplication firewalls in front of critical applications (example: credit card processing).



Environments that require stateful firewalls between data center and external networks could use a combination of WAN edge firewall and per-server packet filters, or a combination of WAN edge firewall and per-application firewalls;



In extreme cases one could use three (or more) layers of defense: a WAN edge firewall performing coarse traffic filtering and HTTP/HTTPS inspection, and another layer of stateful firewalls or WAFs protecting individual applications combined with per-server protection (packet filters or firewalls).

© Copyright ipSpace.net 2014

Page 8-17

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

BEYOND THE TECHNOLOGY CHANGES Migration from a centralized firewall to a distributed system with numerous traffic control points is a fundamental change in the security paradigm and requires a thorough redesign of security-related processes, roles and responsibilities: 

Distributed traffic control points (firewalls or packet filters) cannot be configured and managed with the same tools as a single device. ACME operations team SHOULD use an orchestration tool that will deploy the traffic filters automatically (most cloud orchestration platforms and virtual firewall products include tools that can automatically deploy configuration changes across a large number of traffic control points); System administrators went through a similar process when they migrated workloads from mainframe computers to x86-based servers.



Per-application traffic control is much simpler and easier to understand than a centralized firewall ruleset, but it’s impossible to configure and manage tens or hundreds of small point solutions manually. The firewall (or packet filter) management SHOULD use the automation, orchestration and management tools the server administrators already use to manage large number of servers.



Application teams SHOULD become responsible for the whole application stack including the security products embedded in it. The might not configure the firewalls or packet filters themselves, but SHOULD own them in the same way they own all other specialized components in the application stack like databases.

© Copyright ipSpace.net 2014

Page 8-18

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars



Security team’s role SHOULD change from enforcer of security to validator of security – they should validate and monitor the implementation of security mechanisms instead of focusing on configuring the traffic control points. Simple tools like nmap probes deployed outside- and within the data center are good enough to validate the proper implementation of L3-4 traffic control solutions including packet filters and firewalls.

© Copyright ipSpace.net 2014

Page 8-19

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

9

COMBINE PHYSICAL AND VIRTUAL APPLIANCES IN A PRIVATE CLOUD

IN THIS CHAPTER: EXISTING NETWORK SERVICES DESIGN SECURITY REQUIREMENTS PRIVATE CLOUD INFRASTRUCTURE NETWORK SERVICES IMPLEMENTATION OPTIONS NETWORK TAPPING AND IDS LOAD BALANCING NETWORK-LEVEL FIREWALLS DEEP PACKET INSPECTION AND APPLICATION-LEVEL FIREWALLS SSL AND VPN TERMINATION

THE REALITY INTERVENES

© Copyright ipSpace.net 2014

Page 9-1

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Central IT department of the government of Genovia is building a new private cloud which will consolidate workloads currently being run at satellite data centers throughout various ministries. The new private cloud should offer centralized security, quick application deployment capabilities, and easy integration of existing application stacks that are using a variety of firewalls and load balancers from numerous vendors.

EXISTING NETWORK SERVICES DESIGN All current data centers (run by the central IT department and various ministries) use traditional approaches to network services deployment (see Figure 2-1): 

A pair of central firewalls and load balancers;



Small number of security zones implemented with VLAN-based subnets;



Security zones shared by multiple applications.

© Copyright ipSpace.net 2014

Page 9-2

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 9-1: Centralized network services implemented with physical appliances

© Copyright ipSpace.net 2014

Page 9-3

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

The central IT department provides Internet connectivity to the whole government; other data centers connect to the private WAN network as shown in Figure 9-2.

Figure 9-2: Centralized network services implemented with physical appliances

© Copyright ipSpace.net 2014

Page 9-4

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

SECURITY REQUIREMENTS Applications run within the existing data centers have highly varying security requirements. Most of them need some sort of network-layer protection; some of them require deep packet inspection or application-level firewalls that have been implemented with products from numerous vendors (each ministry or department used to have its own purchasing processes). Most application stacks rely on data stored in internal databases or in the central database server (resident in the central data center); some applications need access to third-party data reachable over the Internet or tightly-controlled extranet connected to the private WAN network (see Figure 9-3).

Figure 9-3: Applications accessing external resources

© Copyright ipSpace.net 2014

Page 9-5

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Migrating the high-security applications into the security zones that have already been established within the central data center is obviously out of question – some of these applications are not allowed to coexist in the same security zone with any other workload. The number of security zones in the consolidated data center will thus drastically increase, more so if the cloud architects decide to make each application an independent tenant with its own set of security zones.65

PRIVATE CLOUD INFRASTRUCTURE The consolidated private cloud infrastructure will be implemented with the minimum possible variability of the physical hardware to minimize the hardware maintenance costs, effectively implementing the cloud-as-an-appliance design66. For more details, please read the Designing a Private Cloud Network Infrastructure chapter.

The cloud architecture team decided to virtualize the whole infrastructure, including large baremetal servers, which will be implemented as a single VM running on a dedicated physical server, and network services appliances, which will be implemented with open-source or commercial products in VM format.

65

Make every application an independent tenant http://blog.ipspace.net/2013/11/make-every-application-independent.html

66

Cloud-as-an-Appliance design http://blog.ipspace.net/2013/07/cloud-as-appliance-design.html

© Copyright ipSpace.net 2014

Page 9-6

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

NETWORK SERVICES IMPLEMENTATION OPTIONS The new centralized private cloud infrastructure has to offer the following network services: 

Centralized DNS server;



Network tapping and IDS capabilities;



Load balancing capabilities for scale-out applications;



Network-level firewalls;



Application-level firewalls and/or deep packet inspection;



SSL and VPN termination;

NETWORK TAPPING AND IDS The unified approach to workload deployment resulted in simplified network tapping infrastructure: server-level tapping is implemented in the hypervisors, with additional hardware taps or switch SPAN ports deployed as needed. IDS devices will be deployed as VMs on dedicated hardware infrastructure to ensure the requisite high performance; high-speed IDS devices inspecting the traffic to and from the Internet will use hypervisor bypass capabilities made possible with SR-IOV or similar technologies67,68.

67

VM-FEX: not as convoluted as it looks http://blog.ipspace.net/2011/08/vm-fex-how-convoluted-can-you-get.html

68

Cisco and VMware – merging the virtual and physical NICs http://blog.ipspace.net/2012/03/cisco-vmware-merging-virtual-and.html

© Copyright ipSpace.net 2014

Page 9-7

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

LOAD BALANCING The “virtualize everything” approach to cloud infrastructure significantly simplifies the implementation of load balancing services. Load balancing could be offered as a service (implemented with a centrally-managed pool of VM-based load balancing appliances), or implemented with per-application load balancing instances. Individual customers (ministries or departments) migrating their workloads into the centralized private cloud infrastructure could also choose to continue using their existing load balancing vendors, and simply migrate their own load balancing architecture into a fully virtualized environment (Bring-Your-Own-Load Balancer approach).

NETWORK-LEVEL FIREWALLS Most hypervisor- or cloud orchestration products support VM NIC-based packet filtering capabilities, either in form of simple access lists, or in form of distributed (semi)stateful firewalls. The centralized private cloud infrastructure could use these capabilities to offer baseline security to all tenants. Individual tenants could increase the security of their applications by using firewall appliances offered by the cloud infrastructure (example: vShield Edge) or their existing firewall products in VM format.

© Copyright ipSpace.net 2014

Page 9-8

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

DEEP PACKET INSPECTION AND APPLICATION-LEVEL FIREWALLS As of summer of 2014, no hypervisor vendor offers deep packet inspection (DPI) capabilities in products bundled with the hypervisor. DPI or application-level firewalls (example: Web Application Firewalls – WAF) have to be implemented as VM-based appliances. Yet again, the tenants might decide to use the default DPI/WAF product offered from the cloud inventory catalog, or bring their own solution in VM format.

SSL AND VPN TERMINATION Most load balancer- and firewall vendors offer their software solutions (which usually include SSLand VPN termination) in VM-based format. Most of these solutions offer more than enough firewalling and load balancing performance for the needs of a typical enterprise application69,70, but might have severe limitations when it comes to encryption and public key/certificate handling capabilities. Modern Intel and AMD CPUs handle AES encryption in hardware71, resulting in high-speed encryption/decryption process as long as the encryption peers negotiate AES as the encryption

69

Virtual appliance performance is becoming a non-issue http://blog.ipspace.net/2013/04/virtual-appliance-performance-is.html

70

Dedicated hardware in network services appliances? http://blog.ipspace.net/2013/05/dedicated-hardware-in-network-services.html

71

AES instruction set http://en.wikipedia.org/wiki/AES_instruction_set

© Copyright ipSpace.net 2014

Page 9-9

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

algorithm, the appliance vendor uses AES-NI instruction set in its software, and the VM runs on a server with AES-NI-capable CPU. The RSA algorithm performed during the SSL handshake is still computationally intensive; software implementations might have performance that is orders of magnitude lower than the performance of dedicated hardware used in physical appliances. Total encrypted throughput and number of SSL transactions per second offered by a VMbased load balancing or firewalling product should clearly be one of the major considerations during your product selection process if you plan to implement SSL- or VPN termination on these products.

THE REALITY INTERVENES The virtualize everything approach to cloud infrastructure clearly results in maximum flexibility, but although everyone agreed that the flexibility and speed-of-deployment72 gained by using perapplication network services justify this approach, the security team didn’t feel comfortable connecting the cloud networking infrastructure directly to the outside world. Some application teams also expressed performance concerns. For example, the main government web site might generate more than 10Gbps of traffic – too much for most VM-based load balancing or firewalling solutions.

72

Typical enterprise application deployment process is broken http://blog.ipspace.net/2013/11/typical-enterprise-application.html

© Copyright ipSpace.net 2014

Page 9-10

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

In the end, the cloud architecture team proposed a hybrid solution displayed in Figure 9-4: 

A set of physical firewalls will perform the baseline traffic scrubbing;



SSL termination will be implemented on a set of physical load balancers;



Physical load balancers will perform load balancing for high-volume web sites, and pass the traffic to application-specific load balancers for all other web properties that need high-speed SSL termination services. High-volume web sites might use a caching layer, in which case the physical load balancers send the incoming requests to a set of reverse proxy servers, which further distribute requests to web servers.

Figure 9-4: Hybrid architecture combining physical and virtual appliances

© Copyright ipSpace.net 2014

Page 9-11

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

10

HIGH-SPEED MULTI-TENANT ISOLATION

IN THIS CHAPTER: INTERACTION WITH THE PROVISIONING SYSTEM COMMUNICATION PATTERNS STATELESS OR STATEFUL TRAFFIC FILTERS? PACKET FILTERS ON LAYER-3 SWITCHES PACKET FILTERS ON X86-BASED APPLIANCES INTEGRATION WITH LAYER-3 BACKBONE TRAFFIC CONTROL APPLIANCE CONFIGURATION

CONCLUSIONS

© Copyright ipSpace.net 2014

Page 10-1

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

The Customer is operating a large multi-tenant data center. Each tenant (application cluster, database cluster, or a third-party application stack) has a dedicated container connected to a shared layer-3 backbone. The layer-3 backbone enables connectivity between individual containers and between containers and the outside world (see Figure 10-1).

Figure 10-1: Containers and data center backbone

© Copyright ipSpace.net 2014

Page 10-2

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

The connectivity between the layer-3 backbone and the outside world and the security measures implemented toward the outside world (packet filters or firewalls, IPS/IDS systems) are outside of the scope of this document. Individual containers could be implemented with bare-metal servers, virtualized servers or even independent private clouds (for example, using OpenStack). Multiple logical containers can share the same physical infrastructure; in that case, each container uses an independent routing domain (VRF) for complete layer-3 separation. The Customer wants to implement high-speed traffic control (traffic filtering and/or firewalling) between individual containers and the shared high-speed backbone. The solutions should be redundant, support at least 10GE speeds, and be easy to manage and provision through a central provisioning system.

INTERACTION WITH THE PROVISIONING SYSTEM The Customer has a central database of containers and managed servers, and uses Puppet to provision bare-metal and virtualized servers and application stacks (see Figure 10-2 for details). They want to use the information from the central database to generate the traffic control rules between individual containers and the layer-3 backbone, and a tool that will automatically push the traffic control rules into devices connecting containers to the layer-3 backbone whenever the information in the central database changes.

© Copyright ipSpace.net 2014

Page 10-3

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 10-2: Interaction with the provisioning/orchestration system

© Copyright ipSpace.net 2014

Page 10-4

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

COMMUNICATION PATTERNS All the communication between individual containers and the outside world falls into one of these categories: 

TCP sessions established from an outside client to a server within a container (example: web application sessions). Target servers are identified by their IP address (specified in the orchestration system database) or IP prefix that covers a range of servers;



TCP sessions established from one or more servers within a container to a well-known server in another container (example: database session between an application server and a database server). Source and target servers are identified by their IP addresses or IP prefixes;



UDP sessions established from one or more servers within a container to a well-known server in another container (example: DNS and syslog). Source and target servers are identified by their IP addresses or IP prefixes.

All applications are identified by their well-known port numbers; traffic passing a container boundary does not use dynamic TCP or UDP ports73. Servers within containers are not establishing TCP sessions with third-party servers outside of the data center. There is no need for UDP communication between clients within the data center and servers outside of the data center.

73

Are Your Applications Cloud-Friendly? http://blog.ipspace.net/2013/11/are-your-applications-cloud-friendly.html

© Copyright ipSpace.net 2014

Page 10-5

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

STATELESS OR STATEFUL TRAFFIC FILTERS? We can implement the desired traffic control between individual containers and the shared layer-3 backbone with stateful firewalls or stateless packet filters (access control lists – ACL)74. The use of stateful firewalls isolating individual containers from the shared backbone is not mandated by regulatory requirements (example: PCI); the Customer can thus choose to implement the traffic control rules with stateless filters, assuming that the devices used to implement traffic filters recognize traffic belonging to an established TCP session (TCP packets with ACK or RST bit set). Stateless packet filters cannot reassemble IP fragments or check the correctness of individual TCP sessions. You should use them only in environments with properly hardened and regularly updated server operating systems. You should also consider dropping all IP fragments at the container boundary. The following table maps the traffic categories listed in Communication Patterns section into typical ACL rules implementable on most layer-3 switches:

74

I Don’t Need no Stinking Firewall … or Do I? http://blog.ipspace.net/2010/08/i-dont-need-no-stinking-firewall-or-do.html

© Copyright ipSpace.net 2014

Page 10-6

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Ingress ACL

Egress ACL

TCP sessions established from an outside client to an inside server permit tcp any dst-server-ip eq dst-

permit tcp server-ip eq dst-port any

port

established

TCP sessions established from inside clients to servers in other container permit tcp dst-server-ip eq dst-port

permit tcp src-server-ip dst-server-

src-server-ip established

ip eq destination-port

TCP sessions established from clients in other containers permit tcp src-server-ip dst-server-

permit tcp dst-server-ip eq dst-port

ip eq dst-port

src-server-ip established

UDP sessions between inside clients and servers in other containers permit udp dst-server-ip eq dst-port

permit udp src-server-ip dst-server-

src-server-ip

ip eq destination-port

UDP sessions between clients in other containers and inside servers permit udp src-server-ip dst-server-

permit udp dst-server-ip eq dst-port

ip eq dst-port

src-server-ip

© Copyright ipSpace.net 2014

Page 10-7

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

PACKET FILTERS ON LAYER-3 SWITCHES The traffic control rules expressed as stateless ACLs are easy to implement on layer-3 switches connecting individual containers to the shared layer-3 backbone. One should, however, keep the following considerations in mind: 

TCAM size: typical data center top-of-rack (ToR) switches support limited number of ACL entries75. A few thousand ACL entries is more than enough when the traffic control rules use IP prefixes to identify groups of servers; when an automated tool builds traffic control rules based on IP addresses of individual servers, the number of ACL entries tends to explode due to Cartesian product76 of source and destination IP ranges. Object groups available in some products are usually implemented as a Cartesian product to speed up the packet lookup process.

75

Nexus 5500 has 1600 ingress and 2048 egress ACL entries http://www.cisco.com/c/en/us/products/collateral/switches/nexus-5000-series-switches/data_sheet_c78618603.html Arista 7050 supports 4K ingress ACL and 1K egress ACL entries. http://www.aristanetworks.com/media/system/pdf/Datasheets/7050QX-32_Datasheet.pdf Arista 7150 supports up to 20K ACL entries http://www.aristanetworks.com/media/system/pdf/Datasheets/7150S_Datasheet.pdf

76

See http://en.wikipedia.org/wiki/Cartesian_product

© Copyright ipSpace.net 2014

Page 10-8

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars



Multi-vendor environment: whenever the data center contains ToR switches from multiple vendors, the provisioning system that installs traffic control rules must implement an abstraction layer that maps traffic patterns into multiple vendor-specific configuration syntaxes.



Configuration mechanisms: most switch vendors don’t offer APIs that would be readily compatible with common server provisioning tools (example: Puppet). Juniper offers a Junos Puppet client, but its current version cannot provision manage access control lists77. Arista provides Puppet installation instructions for EOS78 but does not offer agent-side code that would provision ACLs.

PACKET FILTERS ON X86-BASED APPLIANCES Interface between individual containers and layer-3 backbone can also be implemented with x86based appliances using commercial or open-source traffic filtering tools as shown in Figure 10-3. This approach is more flexible and easier-to-provision, but also significantly slower than hardwarebased packet filters.

77

netdev_stdlib Puppet Resource Types Overview http://www.juniper.net/techpubs/en_US/junos-puppet1.0/topics/concept/automation-junos-puppetnetdev-module-overview.html

78

Installing Puppet on EOS https://eos.aristanetworks.com/2013/09/installing-puppet-on-eos/

© Copyright ipSpace.net 2014

Page 10-9

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 10-3: Traffic control appliances

Standard Linux traffic filters (implemented, for example, with iptables or flow entries in Open vSwitch) provide few gigabits of throughput due to the overhead of kernel-based packet forwarding79. Solutions that rely on additional hardware capabilities of modern network interface cards (NICs) and poll-based user-mode forwarding easily achieve 10Gbps throughput that satisfies the Customer’s requirements. These solutions include: 79

Custom Stack: It Goes to 11 http://blog.erratasec.com/2013/02/custom-stack-it-goes-to-11.html

© Copyright ipSpace.net 2014

Page 10-10

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars



PF_RING, an open-source kernel module that includes 10GE hardware packet filtering;



Snabb Switch, an open-source Linux-based packet processing application that can be scripted with Lua to do custom packet filtering;



6WIND’s TCP/IP stack;



Intel’s Data Path Development Kit (DPDK) and DPDK-based Open vSwitch (OVS). The solutions listed above are primarily frameworks, not ready-to-use traffic control products. Integration- and other professional services are available for most of them.

INTEGRATION WITH LAYER-3 BACKBONE The x86-based appliances inserted between the layer-3 backbone and layer-3 switches in individual containers could behave as layer-3 devices (Figure 10-4) or as transparent (bump-in-the-wire) devices (Figure 10-5).

© Copyright ipSpace.net 2014

Page 10-11

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 10-4: Layer-3 traffic control devices

© Copyright ipSpace.net 2014

Page 10-12

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 10-5: Bump-in-the-wire traffic control devices

Inserting new layer-3 appliances between container layer-3 switches and backbone switches requires readdressing (due to additional subnets being introduced between existing adjacent layer-3 devices) and routing protocol redesign. Additionally, one would need robust routing protocol support on the x86-based appliances. It’s thus much easier to insert the x86-based appliances in the forwarding path as transparent layer-2 devices. Transparent appliances inserted in the forwarding path would not change the existing network addressing or routing protocol configurations. The existing layer-3 switches would continue to run

© Copyright ipSpace.net 2014

Page 10-13

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

the routing protocol across layer-2 devices (the traffic control rules would have to be adjusted to permit routing protocol updates), simultaneously checking the end-to-end availability of the forwarding path – a failure in the transparent traffic control device would also disrupt the routing protocol adjacency and cause layer-3 switches to shift traffic onto an alternate path as shown in Figure 10-6.

Figure 10-6: Routing protocol adjacencies across traffic control appliances

© Copyright ipSpace.net 2014

Page 10-14

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

The transparent x86-based appliances used for traffic control purposes thus have to support the following data path functionality: 

VLAN-based interfaces to support logical containers that share the same physical infrastructure;



Transparent (bridge-like) traffic forwarding between two physical or VLAN interfaces. All non-IP traffic should be forwarded transparently to support non-IP protocols (ARP) and any deployment model (including scenarios where STP BPDUs have to be exchanged between L3 switches);



Ingress and egress IPv4 and IPv6 packet filters on physical or VLAN-based interfaces.

Ideally, the appliance would intercept LLDP packets sent by the switches and generate LLDP hello messages to indicate its presence in the forwarding path.

TRAFFIC CONTROL APPLIANCE CONFIGURATION Linux-based x86 appliances could offer numerous configuration mechanisms, from SSH access to orchestration system agents (Puppet or Chef agent). OpenFlow-based software switching using Open vSwitch (or another OpenFlow-capable software switch) to filter and forward traffic would also allow appliance management with an OpenFlow controller. The Customer thus has at least the following appliance configuration and management options: 

Integration with existing server orchestration system (example: Puppet);



OpenFlow-based flow management.

Both approaches satisfy the technical requirements (assuming the customer uses DPDK-based OVS to achieve 10Gbps+ performance); the Customer should thus select the best one based on the existing environment, familiarity with orchestration tools or OpenFlow controllers, and the amount of

© Copyright ipSpace.net 2014

Page 10-15

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

additional development that would be needed to integrate selected data path solution with desired orchestration/management system. As of February 2014 no OpenFlow controller (commercial or open-source) includes a readily available application that would manage access control lists on independent transparent appliances; the Customer would thus have to develop a custom application on top of one of the OpenFlow controller development platforms (Open Daylight, Floodlight, Cisco’s ONE controller or HP’s OpenFlow controller). Puppet agent managing PF_RING packet filters or OVS flows through CLI commands is thus an easier option. NEC’s ProgrammableFlow could support Customer’s OVS deployment model but would require heavily customized configuration (ProgrammableFlow is an end-to-end fabric-wide OpenFlow solution) running on non-mainstream platform (OVS is not one of the common ProgrammableFlow switching elements).

© Copyright ipSpace.net 2014

Page 10-16

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

CONCLUSIONS The Customer SHOULD80 use packet filters (ACLs) and not stateful firewalls to isolate individual containers from the shared L3 backbone. Existing layer-3 switches MAY be used to implement the packet filters needed to isolate individual assuming the number of rules in the packet filters does not exceed the hardware capabilities of the layer-3 switches (number of ingress and egress ACL entries). The Customer SHOULD consider x86-based appliances that would implement packet filters in software or NIC hardware. The appliances SHOULD NOT use Linux-kernel-based packet forwarding (usermode poll-based forwarding results in significantly higher forwarding performance). x86-based appliances SHOULD use the same configuration management tools that the Customer uses to manage other Linux servers. Alternatively, the customer MAY consider an OpenFlow-based solution composed of software (x86-based) OpenFlow switches and a cluster of OpenFlow controllers.

80

RFC 2119: Key words for use in RFCs to Indicate Requirement Levels http://tools.ietf.org/html/rfc2119

© Copyright ipSpace.net 2014

Page 10-17

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

11

SCALE-OUT PRIVATE CLOUD INFRASTRUCTURE

IN THIS CHAPTER: CLOUD INFRASTRUCTURE FAILURE DOMAINS IMPACT OF SHARED MANAGEMENT OR ORCHESTRATION SYSTEMS IMPACT OF LONG-DISTANCE VIRTUAL SUBNETS IMPACT OF LAYER-2 CONNECTIVITY REQUIREMENTS

WORKLOAD MOBILITY CONSIDERATIONS HOT AND COLD VM MOBILITY DISASTER RECOVERY OR WORKLOAD MIGRATION? WORKLOAD MOBILITY AND ORCHESTRATION SYSTEM BOUNDARIES

CONCLUSIONS

© Copyright ipSpace.net 2014

Page 11-1

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

ACME Inc. is building a large fully redundant private infrastructure-as-a-service (IaaS) cloud using standardized single-rack building blocks. Each rack will have: 

Two ToR switches providing intra-rack connectivity and access to the corporate backbone;



Dozens of high-end servers, each server capable of running between 50 and 100 virtual machines;



Storage elements, either a storage array, server-based storage nodes, or distributed storage (example: VMware VSAN, Nutanix, Ceph…).

Figure 11-1: Standard cloud infrastructure rack

© Copyright ipSpace.net 2014

Page 11-2

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

They plan to use several geographically dispersed data centers with each data center having one or more standard infrastructure racks. Racks in smaller data centers (example: colocation) connect straight to the WAN backbone, racks in data centers co-resident with significant user community connect to WAN edge routers, and racks in larger scale-out data centers connect to WAN edge routers or internal data center backbone.

Figure 11-2: Planned WAN connectivity

© Copyright ipSpace.net 2014

Page 11-3

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

The cloud infrastructure design should: 

Guarantee full redundancy;



Minimize failure domain size - a failure domain should not span more than a single infrastructure rack, making each rack an independent availability zone.



Enable unlimited workload mobility. This case study focuses on failure domain analysis and workload mobility challenges. Typical rack design is described in the Redundant Server-to-Network Connectivity case study, WAN connectivity aspects in Redundant Data Center Internet Connectivity one, and security aspects in High-Speed Multi-Tenant Isolation.

CLOUD INFRASTRUCTURE FAILURE DOMAINS A typical cloud infrastructure has numerous components, including: 

Compute and storage elements;



Physical and virtual network connectivity within the cloud infrastructure;



Network connectivity with the outside world;



Virtualization management system;



Cloud orchestration system;



Common network services (DHCP, DNS);



Application-level services (example: authentication service, database service, backup service) and associated management and orchestration systems.

© Copyright ipSpace.net 2014

Page 11-4

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 11-3: Cloud infrastructure components

© Copyright ipSpace.net 2014

Page 11-5

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Cloud infrastructure environments based on enterprise server virtualization products commonly use separate virtualization management systems (example: VMware vCenter, Microsoft System Center Virtual Machine Manager) and cloud orchestration systems (example: VMware vCloud Automation Center, Microsoft System Center Orchestrator). Single-purpose IaaS solutions (example: OpenStack, CloudStack on Xen/KVM) include the functionality typically provided by a virtualization management system in a cloud orchestration platform. ACME Inc. wants each infrastructure rack to be an independent failure domain. Each infrastructure rack must therefore have totally independent infrastructure and should not rely on critical services, management or orchestration systems running in other racks. A failure domain is the area of an infrastructure impacted when a key device or service experiences problems.

ACME Inc. should therefore (in an ideal scenario) deploy an independent virtualization management system (example: vCenter) and cloud orchestration system (example: vCloud Automation Center or CloudStack) in each rack. Operational and licensing considerations might dictate a compromise where multiple racks use a single virtualization or orchestration system.

© Copyright ipSpace.net 2014

Page 11-6

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

IMPACT OF SHARED MANAGEMENT OR ORCHESTRATION SYSTEMS A typical virtualization management system (example: vCenter) creates, starts and stops virtual machines, creates and deletes virtual disks, and manages virtual networking components (example: port groups). A typical cloud orchestration system (example: vCloud Automation Center, CloudStack, OpenStack) provides multi-tenancy aspects of IaaS service, higher-level abstractions (example: subnets and IP address management, network services, VM image catalog), and API, CLI and GUI access. Both systems usually provide non-real-time management-plane functionality and do not interact with the cloud infrastructure data- or control plane. Failure of one of these systems thus represents a management-plane failure – existing infrastructure continues to operate, but it’s impossible to add, delete or modify its services (example: start/stop VMs). Some hypervisor solutions (example: VMware High Availability cluster) provider control-plane functionality that can continue to operate, adapt to topology changes (example: server or network failure), and provide uninterrupted service (including VM moves and restarts) without intervention of a virtualization- or cloud management system. Other solutions might rely on high availability algorithms implemented in an orchestration system – an orchestration system failure would thus impact the high-availability functionality, making the orchestration system a mission-critical component.

© Copyright ipSpace.net 2014

Page 11-7

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Recommendation: If ACME Inc. designers decide to implement a single instance of a cloud orchestration system81 spanning all racks in the same data center, they SHOULD82: 

Use a server high-availability solution that works independently of the cloud orchestration system;



Implement automated cloud orchestration system failover procedures;



Periodically test the proper operation of the cloud orchestration system failover.

81

A cloud orchestration system instance might be implemented as a cluster of multiple hosts running (potentially redundant) cloud orchestration system components.

82

RFC 2119, Key words for use in RFCs to Indicate Requirement Levels http://tools.ietf.org/html/rfc2119

© Copyright ipSpace.net 2014

Page 11-8

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 11-4: Single orchestration system used to manage multiple racks

Recommendation: ACME Inc. SHOULD NOT use a single critical management-, orchestration- or service instance across multiple data centers – a data center failure or Data Center Interconnect (DCI) link failure would render one or more dependent data centers inoperable.

© Copyright ipSpace.net 2014

Page 11-9

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

IMPACT OF LONG-DISTANCE VIRTUAL SUBNETS Long-distance virtual subnets (a physical or logical bridging domain spanning multiple availability zones) are sometimes used to implement workload mobility or alleviate the need for IP renumbering during workload migration or disaster recovery. Server- and virtualization administrators tend to prefer long-distance virtual subnets over other approaches due to their perceived simplicity, but the end-to-end intra-subnet bridging paradigm might introduce undesired coupling across availability zones or even data centers when implemented with traditional VLAN-based bridging technologies. There are several technologies that reliably decouple subnet-level failure domain from infrastructure availability zones83. Overlay virtual networks (transport of VM MAC frames across routed IP infrastructure) is the most commonly used one in data center environments. EoMPLS, VPLS and EVPN provide similar failure isolation functionality in MPLS-based networks.

Summary: Long-distance virtual subnets in ACME Inc. cloud infrastructure MUST use overlay virtual networks.

83

Decouple virtual networking from the physical world http://blog.ipspace.net/2011/12/decouple-virtual-networking-from.html

© Copyright ipSpace.net 2014

Page 11-10

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

IMPACT OF LAYER-2 CONNECTIVITY REQUIREMENTS Some server clustering- and storage replication solutions require end-to-end layer-2 connectivity (some storage arrays require layer-2 connectivity even when using iSCSI-based replication). Providing layer-2 connectivity inside a single rack doesn’t increase the failure domain size – the network within a rack is already a single failure domain. Extending a single VLAN across multiple racks makes all interconnected racks a single failure domain84. Recommendation: ACME Inc. MUST use layer-3 connectivity between individual racks and the corporate backbone. Using overlay virtual networks it’s easy to provide end-to-end layer-2 connectivity between VMs without affecting the infrastructure failure domains. Unfortunately one cannot use the same approach for disk replication or bare metal servers. Virtualizing bare metal servers in one-VM-per-host setup solves the server clustering challenges; storage replication remains a sore spot.

Recommendation: ACME Inc. SHOULD NOT use storage replication products that require end-toend layer-2 connectivity.

84

Layer-2 network is a single failure domain http://blog.ipspace.net/2012/05/layer-2-network-is-single-failure.html

© Copyright ipSpace.net 2014

Page 11-11

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

ACME Inc. COULD provide long-distance VLAN connectivity with Ethernet-over-MPLS (EoMPLS, VPLS, EVPN) or hardware-based overlay virtual networking solution85. VXLAN Tunnel Endpoint (VTEP) functionality available in data center switches from multiple vendors (Arista 7150, Cisco Nexus 9300) can also be used to extend a single VLAN across its IP backbone, resulting in limited coupling across availability zones.

85

See also: Whose failure domain is it? http://blog.ipspace.net/2014/03/whose-failure-domain-is-it.html

© Copyright ipSpace.net 2014

Page 11-12

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 11-5: VLAN transport across IP infrastructure

Recommendation: Rate-limiting of VXLAN traffic and broadcast storm control MUST be used when using VXLAN (or any other similar technology) to extend a VLAN across multiple availability zones to limit the amount of damage a broadcast storm in one availability zone can cause in other availability zones.

© Copyright ipSpace.net 2014

Page 11-13

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

WORKLOAD MOBILITY CONSIDERATIONS ACME Inc. cloud infrastructure should provide unlimited workload mobility. This goal is easy to achieve within a single availability zone: all servers are connected to the same ToR switches, giving linerate connectivity to all other servers within the same availability zone. VM mobility within the availability zone is thus a given. Live VM mobility across availability zones is harder to achieve and might not make much sense – the tight coupling between infrastructure elements usually required for live VM mobility often turns the components participating in live VM mobility domain into a single failure domain.

HOT AND COLD VM MOBILITY There are two different mechanisms workload mobility mechanisms one can use in IaaS cloud infrastructure: hot (or live) VM mobility where a running VM is moved from one hypervisor host to another and cold VM mobility where a VM is shut down, and its configuration moved to another hypervisor where the VM is restarted. Some virtualization vendors might offer a third option: warm VM mobility where you pause a VM (saving its memory to a disk file), and resume its operation on another hypervisor.

TYPICAL USE CASES Automatic resource schedulers (example: VMware DRS) use hot VM mobility to move running VMs between hypervisors in a cluster when trying to optimize host resource (CPU, RAM) utilization. It is

© Copyright ipSpace.net 2014

Page 11-14

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

also heavily used for maintenance purposes: for example, you might have to evacuate a rack of servers before shutting it down for maintenance or upgrade. Cold VM mobility is used in almost every high-availability and disaster recovery solution. VMware High Availability (and similar solutions from other hypervisor vendors) restarts a VM on another cluster host after the server failure. VMware’s SRM does something similar, but usually in a different data center. Cold VM mobility is also the only viable technology for VM migration between multiple cloud orchestration systems (for example, when migrating a VM from private cloud into a public cloud).

HOT VM MOBILITY VMware’s vMotion is probably the best-known example of hot VM mobility technology. vMotion copies memory pages of a running VM to another hypervisor, repeating the process for pages that have been modified while the memory was transferred. After most of the VM memory has been successfully transferred, vMotion freezes the VM on source hypervisor, moves its state to another hypervisor, and restarts it there. A hot VM move must not disrupt the existing network connections and must thus preserve the following network-level state: 

VM must retain the same IP address;



VM should have the same MAC address (otherwise we have to rely on hypervisor-generated gratuitous ARP to update ARP caches on other nodes in the same subnet);

© Copyright ipSpace.net 2014

Page 11-15

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars



After the move, the VM must be able to reach first-hop router and all other nodes in the same subnet using their existing MAC addresses (hot VM move is invisible to the VM, so the VM IP stack doesn’t know it should purge its ARP cache).

The only mechanisms we can use today to meet all these requirements are: 

Stretched layer-2 subnets, whether in a physical (VLAN) or virtual (VXLAN) form;



Hypervisor switches with layer-3 capabilities, including Hyper-V 3.0 Network Virtualization and Juniper Contrail.

Recommendation: ACME Inc. should keep the hot VM mobility domain as small as possible.

COLD VM MOVE In a cold VM move a VM is shut down and restarted on another hypervisor. The MAC address of the VM could change during the move, as could its IP address… unless the application running on the VM doesn’t use DNS to advertise its availability. Recommendation: The new cloud infrastructure built by ACME Inc. SHOULD NOT be used by poorly written applications that are overly reliant on static IP addresses86. VMs that rely on static IP addressing might also have manually configured IP address of the first-hop router. Networking- and virtualization vendors offer solutions that reduce the impact of that bad practice (first-hop localization, LISP…) while significantly increasing the overall network complexity.

86

Are your applications cloud-friendly? http://blog.ipspace.net/2013/11/are-your-applications-cloud-friendly.html

© Copyright ipSpace.net 2014

Page 11-16

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Recommendation: Workloads deployed in ACME Inc. cloud infrastructure SHOULD NOT use static IP configuration. VM IP addresses and other related parameters (first-hop router, DNS server address) MUST be configured via DHCP or via cloud orchestration tools.

IMPACT OF STATIC IP ADDRESSING A VM moved to a different location with a cold VM move still leaves residual traces of its presence in the original subnet: entries in ARP caches of adjacent hosts and routers. Routers implementing IP address mobility (example: Hyper-V or VMware NSX hypervisor hosts, LISP routers) are usually updated with new forwarding information, adjacent hosts and VMs aren’t. These hosts might try to reach the moved VM using its old MAC address, requiring end-to-end layer2 connectivity between the old and new VM location. Hyper-V Network Virtualization uses pure layer-3 switching in the hypervisor virtual switch. VM moves across availability zones are thus theoretically possible… as long as the availability zones use a shared orchestration system.

DISASTER RECOVERY OR WORKLOAD MIGRATION? Disaster recovery is the most common workload mobility use case: workloads are restarted in a different availability zone (typically a different data center) after an availability zone failure. Contrary to popular belief propagated by some networking and virtualization vendors disaster recovery does not require hot VM mobility and associated long-distance virtual subnets – it’s much simpler to recreate the virtual subnets and restart the workload in a different availability zone using

© Copyright ipSpace.net 2014

Page 11-17

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

a workload orchestration- or disaster recovery management tool (example: VMware Site Recovery Manager – SRM)87. This approach works well even for workloads that require static IP addressing within the application stack – internal subnets (port groups) using VLANs or VXLAN segments are recreated within the recovery data center prior to workload deployment. Another popular scenario that requires hot VM mobility is disaster avoidance – live workload migration prior to a predicted disaster. Disaster avoidance between data centers is usually impractical due to bandwidth constraints88. While it might be used between availability zones within a single data center, that use case is best avoided due to additional complexity and coupling introduced between availability zones. Increased latency between application components and traffic trombones89,90 are additional challenges one must consider when migrating individual components of an application stack. It’s usually simpler and less complex to move the whole application stack as a single unit.

87

Long-distance vMotion, stretched HA clusters and business needs http://blog.ipspace.net/2013/01/long-distance-vmotion-stretched-ha.html

88

Long-distance vMotion for disaster avoidance? Do the math first! http://blog.ipspace.net/2011/09/long-distance-vmotion-for-disaster.html

89

Traffic trombone – what is it and how you get them http://blog.ipspace.net/2011/02/traffic-trombone-what-it-is-and-how-you.html

90

Long-distance vMotion and traffic trombones http://blog.ipspace.net/2010/09/long-distance-vmotion-and-traffic.html

© Copyright ipSpace.net 2014

Page 11-18

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

WORKLOAD MOBILITY AND ORCHESTRATION SYSTEM BOUNDARIES Most virtualization- and cloud orchestration systems do not support workload mobility across multiple orchestration system instances. For example, VMware vSphere 5.5 (and prior releases) supports vMotion (hot VM mobility) within hosts that use the same single virtual distributed switch (vDS) and are managed by the same vCenter server. Hot workload migration across availability zones can be implemented only when those zones use the same vCenter server and the same vDS, resulting in a single managementplane failure domain (and a single control-plane failure domain when using Cisco Nexus 1000V91). vSphere 6.0 supports vMotion across distributed switches and even across multiple vCenters, making it possible to implement hot workload mobility across multiple vCenter domains within the same cloud orchestration system. Recommendation: ACME Inc. can use inter-vCenter vMotion to implement hot workload mobility between those availability zones in the same data center that use a single instance of a cloud orchestration system. Each data center within the ACME Inc. private cloud MUST use a separate instance of the cloud orchestration system to limit the size of the management-plane failure domain. ACME Inc. thus cannot use high availability or workload mobility implemented within a cloud orchestration system to move workloads between data centers.

91

What exactly is Nexus 1000V? http://blog.ipspace.net/2011/06/what-exactly-is-nexus-1000v.html

© Copyright ipSpace.net 2014

Page 11-19

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Recommendation: ACME Inc. MUST implement workload migration or disaster recovery with dedicated workload orchestration tools that can move whole application stacks between cloud instances.

CONCLUSIONS Infrastructure building blocks: 

ACME Inc. will build its cloud infrastructure with standard rack-size compute/storage/network elements;



Each infrastructure rack will be an independent data- and control-plane failure domain (availability zone);



Each infrastructure rack must have totally independent infrastructure and SHOULD NOT rely on critical services, management or orchestration systems running in other racks.

Network connectivity: 

Each rack will provide internal L2- and L3 connectivity;



L3 connectivity (IP) will be used between racks and ACME Inc. backbone;



Virtual subnets spanning multiple racks will be implemented with overlay virtual networks implemented within hypervisor hosts;



VLANs spanning multiple racks will be implemented with VXLAN-based transport across ACME Inc. backbone;



VXLAN-encapsulated VLAN traffic across availability zones MUST be rate-limited to prevent broadcast storm propagation across multiple availability zones.

© Copyright ipSpace.net 2014

Page 11-20

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

High availability and workload mobility: 

Each rack will implement a high-availability virtualization cluster that operates even when the cloud orchestration system fails;



Hot VM mobility (example: vMotion) will be used within each rack;



Hot VM mobility MIGHT be used across racks in the same data center assuming ACME Inc. decides to use a single cloud orchestration system instance per data center;



Workload mobility between data centers will be implemented with a dedicated workload orchestration- or disaster recovery tool.

Orchestration and management systems: 

A single management- or orchestration system instance will control a single rack or at most one data center to reduce the size of management-plane failure domain;



Management- and orchestration systems controlling multiple availability zones will have automated failover/recovery procedures that will be thoroughly tested at regular intervals;



ACME Inc. SHOULD NOT use a single critical management-, orchestration- or service instance across multiple data centers.

© Copyright ipSpace.net 2014

Page 11-21

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

12

SIMPLIFY WORKLOAD MIGRATION WITH VIRTUAL APPLIANCES

IN THIS CHAPTER: EXISTING APPLICATION WORKLOADS INFRASTRUCTURE CHALLENGES INCREASE WORKLOAD MOBILITY WITH VIRTUAL APPLIANCES BUILDING A NEXT GENERATION INFRASTRUCTURE BENEFITS OF NEXT-GENERATIONS TECHNOLOGIES TECHNICAL DRAWBACKS OF NEXT-GENERATION TECHNOLOGIES ORGANIZATIONAL DRAWBACKS PHASED ONBOARDING

ORCHESTRATION CHALLENGES USING IP ROUTING PROTOCOLS IN WORKLOAD MIGRATION © Copyright ipSpace.net 2014

Page 12-1

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

CONCLUSIONS

ACME Inc. is building a private cloud and a disaster recovery site that will eventually serve as a second active data center. They want to be able to simplify disaster recovery procedures, and have the ability to seamlessly move workloads between the two sites after the second site becomes an active data center. The ACME’s cloud infrastructure design team is trying to find a solution that would allow them to move quiescent workloads between sites with a minimum amount of manual interaction. They considered VMware’s SRM but found it lacking in the area of network services automation.

EXISTING APPLICATION WORKLOADS ACME plans to deploy typical multi-tier enterprise application stacks in its private cloud. Most workloads use load balancing techniques to spread the load across multiple redundant servers in a tier. Some workloads use application-level load balancing mechanisms within the application stack (Figure 12-1) and require dedicated load balancing functionality only between the clients and first tier of servers, others require a layer of load balancing between adjacent tiers (Figure 12-2).

© Copyright ipSpace.net 2014

Page 12-2

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 12-1: Some applications use application-level load balancing solutions

Figure 12-2: Typical workload architecture with network services embedded in the application stack

© Copyright ipSpace.net 2014

Page 12-3

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Some applications are truly self-contained, but most of them rely on some external services, ranging from DNS and Active Directory authentication to central database access (typical central database example is shown in Figure 12-3).

Figure 12-3: Most applications use external services

Load balancing and firewalling between application tiers is currently implemented with a central pair of load balancers and firewalls, with all application-to-client and server-to-server traffic passing through the physical appliances (non-redundant setup is displayed in Figure 12-4).

© Copyright ipSpace.net 2014

Page 12-4

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 12-4: Application tiers are connected through central physical appliances

INFRASTRUCTURE CHALLENGES The current data center infrastructure supporting badly-written enterprise applications generated a number of problems92 that have to be avoided in the upcoming private cloud design: 

Physical appliances are a significant chokepoint and would have to be replaced with a larger model or an alternate solution in the future private cloud;



Current physical appliances support a limited number of virtual contexts. The existing workloads are thus deployed on shared VLANs (example: web servers of all applications reside in the same

92

Sooner or later someone will pay for the complexity of the kludges you use http://blog.ipspace.net/2013/09/sooner-or-later-someone-will-pay-for.html

© Copyright ipSpace.net 2014

Page 12-5

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

VLAN), significantly reducing the data center security – intruder could easily move laterally between application stacks after breaking into the weakest server93. 

Reliance on physical appliances makes disaster recovery more expensive: an equivalent appliance pair must always be present at the disaster recovery site.



Workload migration between sites requires continuous synchronization of appliance configurations, a challenge that ACME “solved” with extremely brittle long-distance appliance clusters94.

INCREASE WORKLOAD MOBILITY WITH VIRTUAL APPLIANCES Virtual L4-7 appliances (firewalls and load balancers in VM format deployable in standard hypervisor/cloud environments) seem to be a perfect solution for ACME’s workload mobility requirements. Apart from being easier to deploy, virtual appliances offer numerous additional benefits when compared to traditional physical appliances. Flexibility. Virtual appliances can be deployed on demand; the only required underlying physical infrastructure is compute capacity (assuming one permanent or temporary licenses needed to deploy new appliance instances).

93

Compromised security zone = game over http://blog.ipspace.net/2013/04/compromised-security-zone-game-over-or.html

94

Distributed firewalls: how badly do you want to fail? http://blog.ipspace.net/2011/04/distributed-firewalls-how-badly-do-you.html

© Copyright ipSpace.net 2014

Page 12-6

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

This ease-of-deployment makes it perfectly feasible to create a copy of virtual appliance for every application stack, removing the need for complex firewall/load balancing rules and virtual contexts95. Appliance mobility. Virtual appliance is treated like any other virtual machine by server virtualization and/or cloud orchestration tools. It’s as easy (or hard96) to move virtual appliances as the associated application workload between availability zones, data centers, or even private and public clouds. Transport network independence. Physical appliances have a limited number of network interfaces, and typically use VLANs to create additional virtual interfaces needed to support multitenant contexts. Most physical appliances don’t support any other virtual networking technology but VLANs, the exception being F5 BIG-IP which supports IP multicast-based VXLANs.97 Virtual appliances run on top of a hypervisor virtual switch and connect to whatever virtual networking technology offered by the underlying hypervisor with one or more virtual Ethernet adapters as shown in Figure 12-5.

95

Make every application an independent tenant http://blog.ipspace.net/2013/11/make-every-application-independent.html

96

Long distance vMotion for disaster avoidance? Do the math first! http://blog.ipspace.net/2011/09/long-distance-vmotion-for-disaster.html

97

VXLAN termination on physical devices http://blog.ipspace.net/2011/10/vxlan-termination-on-physical-devices.html

© Copyright ipSpace.net 2014

Page 12-7

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Figure 12-5: Virtual appliance NIC connected to overlay virtual network

The number of virtual Ethernet interfaces supported by a virtual appliance is often dictated by hypervisor limitations. For example, vSphere supports up to 10 virtual interfaces per VM98; KVM has much higher limits. Configuration management and mobility. Virtual appliances are treated like any other virtual server. Their configuration is stored on their virtual disk, and when a disaster recovery solution replicates virtual disk data to an alternate location, the appliance configuration becomes automatically available for immediate use at that location – all you need to do after the primary data center failure is to restart the application workloads and associated virtual appliances at an alternate location99. 98

vSphere Configuration Maximums http://www.vmware.com/pdf/vsphere5/r55/vsphere-55-configuration-maximums.pdf

99

Simplify your disaster recovery with virtual appliances http://blog.ipspace.net/2013/05/simplify-your-disaster-recovery-with.html

© Copyright ipSpace.net 2014

Page 12-8

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

The drawbacks of virtual appliances are also easy to identify: 

Reduced performance. Typical virtual appliances can handle few Gbps of L4-7 traffic, and few thousand SSL transactions per second100;



Appliance sprawl. Ease-of-deployment usually results in numerous instances of virtual appliances, triggering the need for a completely different approach to configuration management, monitoring and auditing (the situation is no different from the one we experienced when server virtualization became widely used).



Shift in responsibilities. It’s impossible to configure and manage per-application-stack virtual appliances using the same methods, tools and processes as a pair of physical appliances.



Licensing challenges. Some vendors try to license virtual appliances using the same per-box model they used in the physical world.

BUILDING A NEXT GENERATION INFRASTRUCTURE ACME’s cloud architects decided to use virtual appliances in their new private cloud to increase workload mobility by reducing application dependence on shared physical appliances (see Combine Physical and Virtual Appliances chapter for additional details). The move to virtual appliances enabled them to consider overlay virtual networks – it’s trivial to deploy virtual appliances on top of overlay virtual networks. Finally, they decided to increase the application security by containerizing individual workloads and using VM NIC filters (aka microsegmentation) instead of appliance-based firewalls wherever possible

100

Virtual appliance performance is becoming a non-issue http://blog.ipspace.net/2013/04/virtual-appliance-performance-is.html

© Copyright ipSpace.net 2014

Page 12-9

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

(see Replacing the Central Firewall and High-Speed Multi Tenant Isolation chapters for more details);

BENEFITS OF NEXT-GENERATIONS TECHNOLOGIES ACME architects plan to gain the following benefits from the new technologies introduced in the planned private cloud architecture: 

VM NIC firewalls will increase the packet filtering performance – the central firewalls will no longer be a chokepoint;



Virtual appliances will reduce ACME’s dependence on hardware appliances and increase the overall network services (particularly load balancing) performance with a scale-out appliance architecture;



Overlay virtual networks will ease the deployment of large number of virtual network segments that will be required to containerize the application workloads.

TECHNICAL DRAWBACKS OF NEXT-GENERATION TECHNOLOGIES ACME architects have also quickly identified potential drawbacks of the planned new technologies: 

Most VM NIC firewalls don’t offer the same level of security as their more traditional counterparts – most of them offer “stateful” packet filtering capabilities that are similar to reflexive ACLs101.



In-kernel VM NIC firewalls rarely offer application-level gateways (ALG) or layer-7 payload inspection (deep packet inspection – DPI).

101

The spectrum of firewall statefulness http://blog.ipspace.net/2013/03/the-spectrum-of-firewall-statefulness.html

© Copyright ipSpace.net 2014

Page 12-10

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars



Virtual appliances (load balancers and VM-based firewalls) rarely offer more than a few Gbps of throughput. High-bandwidth applications might still have to use traditional physical appliances.



Overlay virtual networks need software or hardware gateways to connect to the physical subnets. Self-contained applications that use a network services appliance to connect to the outside world could use virtual appliances as overlay-to-physical gateways, applications that rely on information provided by physical servers102 might experience performance problems that would have to be solved with dedicated gateways.

ORGANIZATIONAL DRAWBACKS The technical drawbacks identified by the ACME architects are insignificant compared to organizational and process changes that the new technologies require103,104: 

Move from traditional firewalls to VM NIC firewalls requires a total re-architecture of the application’s network security, including potential adjustments in security policies due to lack of deep packet inspection between application tiers105;

102

Connecting legacy servers to overlay virtual networks http://blog.ipspace.net/2014/05/connecting-legacy-servers-to-overlay.html

103

Typical enterprise application deployment process is broken http://blog.ipspace.net/2013/11/typical-enterprise-application.html

104

They want networking to be utility? Let’s do it! http://blog.ipspace.net/2013/04/they-want-networking-to-be-utility-lets.html

105

Are you ready to change your security paradigm? http://blog.ipspace.net/2013/04/are-you-ready-to-change-your-security.html

© Copyright ipSpace.net 2014

Page 12-11

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars



Virtual appliances increase workload agility only when they’re moved with the workload. The existing centralized appliance architecture has to be replaced with a per-application appliance architecture106;



Increased number of virtual appliances will require a different approach to appliance deployment, configuration and monitoring107.

PHASED ONBOARDING Faced with all the potential drawbacks, the ACME’s IT management team decided to implement a slow onboarding of application workloads. New applications will be developed on the private cloud infrastructure and will include new technologies and concepts in the application design, development, testing and deployment phases; Moving an existing application stack to the new private cloud will always include security and network services reengineering: 

Load balancing rules (or contexts) from existing physical appliances will be migrated to perapplication virtual appliances;



Intra-application firewall rules will be replaced by equivalent rules implemented with VM NIC firewalls wherever possible;

106

Make every application an independent tenant http://blog.ipspace.net/2013/11/make-every-application-independent.html

107

OMG, who will manage all those virtual firewalls? http://blog.ipspace.net/2013/12/omg-who-will-manage-all-those-virtual.html

© Copyright ipSpace.net 2014

Page 12-12

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars



Outside-facing application-specific firewall rules will be moved to per-application virtual appliance. They could be implemented on the load balancing appliance, or in a separate perapplication firewall appliance.

ORCHESTRATION CHALLENGES Virtual appliances and overlay virtual networks enable simplified workload mobility but do not solve that problem. Moving a complex application workload between instances of cloud orchestration systems (sometimes even across availability zones) requires numerous orchestration steps before it’s safe to restart the application workload: 

Virtual machine definitions and virtual disks have to be imported into the target environment (assuming the data is already present on-site due to storage replication or backup procedures);



Internal virtual networks (port groups) used by the application stack have to be recreated in the target environment;



Outside interface(s) of virtual appliances have to be connected to the external networks in the target environment;



Virtual appliances have to be restarted;



Configuration of the virtual appliance outside interfaces might have to be adjusted to reflect different IP addressing scheme used in the target environment. IP readdressing might trigger additional changes in DNS108;

108

IP renumbering in disaster avoidance data center designs http://blog.ipspace.net/2012/01/ip-renumbering-in-disaster-avoidance.html

© Copyright ipSpace.net 2014

Page 12-13

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars



Connectivity to outside services (databases, AD, backup servers) used by the application stack has to be adjusted. Well-behaved applications109 would use DNS and adapt to the changed environment automatically110; poorly written applications might require additional configuration changes (example: NAT) in virtual appliance(s) connecting them to the external networks.

Some orchestration systems (example: vCloud Director) allow users to create application containers that contain enough information to recreate virtual machines and virtual networks in a different cloud instance, but even those environments usually require some custom code to connect migrated workloads to external services. Cloud architects sometimes decide to bypass the limitations of cloud orchestration systems (example: lack of IP readdressing capabilities) by deploying stretched layer-2 subnets, effectively turning multiple cloud instances into a single failure domain111. See the Scale-Out Private Cloud Infrastructure chapter for more details.

109

There is no paradigm shift, good applications were always network-aware http://blog.ipspace.net/2014/07/there-is-no-paradigm-shift-good.html

110

Are your applications cloud-friendly? http://blog.ipspace.net/2013/11/are-your-applications-cloud-friendly.html

111

Layer-2 network is a single failure domain http://blog.ipspace.net/2012/05/layer-2-network-is-single-failure.html

© Copyright ipSpace.net 2014

Page 12-14

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

USING IP ROUTING PROTOCOLS IN WORKLOAD MIGRATION Sometimes it’s possible to circumvent the orchestration system limitations with routing protocols: virtual routers embedded in the application stack advertise application-specific IP address ranges to the physical network devices (see Figure 12-6 for a setup with a dedicated virtual router).

Figure 12-6: Virtual router advertises application-specific IP prefix via BGP

© Copyright ipSpace.net 2014

Page 12-15

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

Some virtual appliances already support routing protocols (example: VMware NSX Edge Service Router, Cisco ASAv). It’s trivial to deploy routing functionality on virtual appliances implemented as Linux services. Finally, one could always run a virtual router in a dedicated virtual machine (example: Cisco CSR 1000v, Vyatta virtual router). Virtual routers could establish routing protocol adjacency (preferably using BGP112) with first-hop layer-3 switches in the physical cloud infrastructure (ToR or core switches – depending on the data center design). One could use BGP peer templates on the physical switches, allowing them to accept BGP connections from a range of directly connected IP addresses (outside IP address assigned to virtual routers via DHCP), and use MD5 authentication to provide some baseline security. A central BGP route server would be an even better solution. The route server could do be dynamically configured by the cloud orchestration system to perform virtual router authentication and route filtering. Finally, you could assign the same loopback IP address to route servers in all data centers (or availability zones), making it easier for the edge virtual router to find its BGP neighbor.

112

Virtual appliance routing – network engineer’s survival guide http://blog.ipspace.net/2013/08/virtual-appliance-routing-network.html

© Copyright ipSpace.net 2014

Page 12-16

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars

CONCLUSIONS ACME Inc. can increase the application workload mobility and overall network services performance with a judicious use of modern technologies like virtual appliances, overlay virtual networks, and VM NIC firewalls, but won’t realize most of the benefits of these technologies unless they also introduce significant changes in their application development and deployment processes113. Workload mobility between different availability zones of the same cloud orchestration system is easy to achieve, as most cloud orchestration system automatically create all underlying objects (example: virtual networks) across availability zones as required. It’s also possible to solve the orchestration challenge of a disaster recovery solution by restarting the cloud orchestration system at a backup location (which would result in automatic recreation of all managed objects, including virtual machines and virtual networks). Workload mobility across multiple cloud orchestration instances, or between private and public clouds, requires extensive orchestration support, either available in the cloud orchestration system, or implemented with an add-on orchestration tool.

113

Reduce costs and gain efficiencies with SDDC http://blog.ipspace.net/2014/08/interview-reduce-costs-and-gain.html

© Copyright ipSpace.net 2014

Page 12-17

This material is copyrighted and licensed for the sole use by Georgi Dobrev ([email protected] [95.158.130.50]). More information at http://www.ipSpace.net/Webinars