159 51 47MB
English Pages [795]
System Issues in Cloud Computing Introduction to Cloud Computing KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
Scale of Cloud Computing • Cloud data centers host upwards of 50K servers • Amazon: – Amazon data centers house between 50,000 and 80,000 servers, with a power consumption of between 25 and 30 megawatts. – 2015: 1.5 – 5.6 million servers
Google Data Center
Source: Jeff Dean’s slide deck from History day at SOSP 2015
Microsoft Azure Global Presence
26 regions available today
(circa July 2016)
8 more announced
Where did Cloud computing start?
Where did Cloud computing start?
….Saint Google Says!.. . . .
Where did Cloud computing start?
Where did Cloud computing start?
Where did Cloud computing start?
Where did Cloud computing start? Just kidding! Some interesting tidbits: Clouds Project @ Georgia Tech •
•
1986-1993 (NSF Coordinated Research Experience – CER - Project) (PIs: Dasgupta, LeBlanc, Ahamad, Ramachandran)
Primary student on the project – Yousef Khalidi (PhD 1989)
Now (Aug 2016): Dr. Yousef Khalidi Corporate Vice President, Microsoft Azure Networking Technologies!!
Where did Cloud computing start? •
Distributed systems research in the 80’s and 90’s – Clouds (GT), Eden (U-Wash), Charlotte (UW), CoW (UW), NoW (UCB), PVM (Emory), MPI (IBM), …
• • •
Grid computing in the 90’s NSF HPC data centers (mid 90’s) Resurrection of virtualization (late 90’s) – originally pioneered by IBM (VM 360 and 370 series) in the 60’s
• • •
Virtualization technologies (SimOS, Disco, Xen) leading to companies such as VMware (early 2000’s) Shrinking margins on selling “boxes” (mid 00’s) “Services computing” model pioneered by IBM
Cloud Computing: Computing as a utility
What is Cloud Computing? • Amazon: "Cloud Computing, by definition, refers to the on-demand delivery of IT resources and applications via the Internet with pay-as-you-go pricing.” (https://aws.amazon.com/what-is-cloud-computing/)
• IBM: “Cloud computing is the delivery of on-demand computing resources—everything from applications to data centers—over the Internet on a pay-for-use basis.” (https://www.ibm.com/cloud-computing/what-is-cloudcomputing)
What is Cloud Computing? •
• • • •
Computational resources (CPUs, memory, storage) in data centers available as “utilities” via the Internet Illusion of infinite computational capacity Ability to elastically increase/decrease resources based on need “Pay as you go” model based on resource usage Applications delivered as “services” over the Internet
Why Cloud Computing? • • • • • •
No CapEx or OpEx for owning/maintaining computational resources Elasticity: being able to shrink/expand resources based on need Maintenance/upgrades someone else’s problem Availability: no down time for the resources or services “Pay as you go” model Business services can be “out sourced” – Concentrate on core competency and let IBMs and Amazons and Microsofts of the world deal with the IT services
• Disaster recovery of assets due to geographic replication of data
System Issues in Cloud Computing Types of Clouds & Service Models KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
Types of clouds Public – – – –
Resources are shared among users Users given the illusion that the resources are “theirs” Virtualization at all levels including the network traffic guarantees perfect isolation at • Resource level and Performance level E.g.: Amazon EC2
Private – – –
Resources are physically dedicated to the individual user Often the service provider may have the data center on user’s premises E.g.: VMware offers such services as part of their business model
Hybrid – – – –
Combines the two (private and public) Keep sensitive business logic and mission-critical data in private cloud Keep more mundane services (trend analysis, test and development, business projections, etc.) in public cloud “Cloud bursting” private Cloud connects to the public cloud when demand exceeds a threshold
Cloud Service Models •
Infrastructure as a Service (Iaas) – Service provider offers to rent resources (CPUs, memory, network bandwidth, storage) •
E.g.: Amazon EC2
– Use them as you would use your own cluster in your basement
•
Platform as a Service (PaaS) – In addition to renting resources, service provider offers APIs programming the resources and developing applications that run on these resources •
E.g.: Microsoft Azure
– Reduces the pain point for the Cloud developer in developing, performance-tuning, scaling large-scale cloud apps
•
Software as a Service (SaaS) – Service provider offers services to increase end-user productivity •
E.g.: gmail, dropbox, YouTube, games,
– User does not see physical resources in the Cloud
Marketplace
IaaSIAAS Market Share 1H MARKET 1H2015 2015 Amazon 27%
Other 36%
Microsoft 16%
Rackspace 2% Google 4%
Oracle 3%
IBM 12%
Source: Wikibon 2015
Marketplace
Salesforce 24%
Other 37%
Amazon 17% Google 2% Oracle 2%Netsuite 2% ServiceNow 3%
IBM 3%
Microsoft 10%
PaaS Market Share 1H 2015
Source: Wikibon 2015
Marketplace (overall)
Marketplace (overall)
Cloud Infrastructure Spending Forecast Public cloud Infrastructure as a service (IaaS) hardware and software spending from 2015 to 2016, by segment 9in billion U.S. dollars
Source: http://www.forbes.com/sites/louiscolumbus/2016/03/13/roundup-of-cloud-computing-forecasts-and-market-estimates-2016/#1f3614f274b0
System Issues in Cloud Computing Security Issues & Challenges KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
Challenges with Cloud computing
Dark Clouds
• Data security • Lock-In with a service provider • Network latency to the provider • Network bandwidth to the provider • Dependence on reliable Internet connectivity
Security Issues • Data breaches • Compromised credentials and broken authentication • Hacked interfaces and APIs • Exploited system vulnerabilities • Account hijacking • Malicious insiders
• • • • • • •
Parasitic computing Permanent data loss Inadequate diligence Cloud service abuses DoS attacks Shared technology, shared dangers
Challenges for Cloud Adopters
Current Issues being Tackled in Cloud Computing • • • • • • • •
Mobile Computing Architecture and Virtualization IoT and Mobile on the Cloud Security and Privacy Distributed Cloud/Edge Computing Big Data HPC Networking (SDN and NFV)
Complex Issues in Cloud Computing Optimal scheduling and resource management Communication isolation, NFV, and SDN HPC with loosely coupled networks Real-Time Computations Energy Incorporating heterogeneous resources (GPUs and other accelerators) • Human in the loop and integration with devices • Improving resource utilization • • • • • •
Energy consumption Greenpeace Analysis of AWS Availability Zones: Backup Generator Permits Location
AWS Availability Zones(s)
Permitting Entity
# of Data Centers
Greenpeace estimate of data center power capacity
Bay Area, CA
US West (N, CA)
A-100
2
36 MW
Dublin, Ireland
EU West
Amazon Data Services Ireland Ltd
3
65 MW
Northern Virginia
US East
Vadata
23
500 MW
Oregon
US West (Oregon) & GovCloud
Vadata
3
65 MW
Typically < 100,000 servers per data center Source: http://www.greenpeace.org/usa/greenpeace-investigation-aws/
System Issues in Cloud Computing Future of Cloud Computing KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
What is next in Cloud Computing?
(Close future)
• Energy efficient computing – New network hardware • Software-Defined Hardware – Software switch
• FPGA-based NIC’s • Improved optical amplification
SmartNIC Source: Microsoft Azure
• “Big data” as a service • Rethink security policies -> Better identity management • Edge computing support
What is Current Cloud computing good for? Throughput oriented apps – Search, Mail, Reservations, Banking, E-commerce,
Increasingly for streaming videos using CDNs or proprietary networks such as Netflix – > 90% of Internet traffic is video
Interactive apps (human in the loop)
Limitations of Existing Cloud
(PaaS)
• Based on large data centers High latency / poor bandwidth for data-intensive apps
• API designed for traditional web applications Not suitable for the future Internet apps
Why • IoT and many new apps need interactive response at computational perception speeds – Sense -> process -> actuate
• Sensors are geo-distributed – Latency to the cloud is a limitation – Besides, uninteresting sensor streams should be quenched at the source
A Broad Set of IoT Applications Energy Saving (I2E) Defense
Predictive maintenance
Industrial Automation
Intelligent Buildings Enable New Knowledge
Enhance Safety & Security Transportation and Connected Vehicles
Healthcare
Agriculture
Smart Home Smart Grid
Source: Thanks to CISCO for the pictures and graphics in this slide
Future Internet Applications on IoT • Common Characteristics – Dealing with real-world data streams – Real-time interaction among mobile devices – Wide-area analytics
• Requirements – Dynamic scalability – Low-latency communication – Efficient in-network processing
Concluding Comments & Food for Thought • Future Cloud – – – – – – – – –
Encompassing geo-distributed edge computing in the context of IoT Distributed programming models and runtime systems Geo-distributed resource allocation Static and dynamic analyses of apps expressed as dataflow graphs with temporal constraints Security and privacy issues for IoT System architecture of edge computing nodes Front-haul networks combining fiber and wifi Deployments and field study (camera networks for campus security) More issues???
System Issues in Cloud Computing Network Virtualization: Basics KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
What is a Data Center? Data Center
Google Data Center
– Provides reliable and scalable computing infrastructure for massive Internet services – Physically located in various (often remote) geographical areas optimizing for energy consumption and availability of economic power
Source: Jeff Dean’s slide deck from History day at SOSP 2015
Data Center Networks • •
Network is one of the main components of the computational resources in a data center Connects all the servers inside a data center and provides connectivity to the Internet outside
Data Center Networks Design Considerations • • •
Data centers hosts on the order of 100’s of thousands of servers A single app (e.g., gmail) may run on 1000’s of servers Servers have to communicate with one another –
•
Data center is multi-tenancy –
•
Server-server communication should be bound only by the interface speeds independent of which servers are hosting a particular app
Need performance isolation for the apps –
•
Multiple independent apps running in the computational resources
Need uniform network capacity for supporting all the apps –
•
Need high throughput low latency communication among the servers
Traffic of one service should not affect another
Need layer 2 semantics as though the servers were on a LAN so far as each app is concerned
Data Center Networks Design Choices • • •
Given the scale, data center networks have to be cost effective while meeting the design considerations Should data center networks use commodity or special switching fabric? What should be the network topology to meet the design considerations and be cost effective? – Need a crossbar interconnection network but that would not be cost effective
•
Same dilemma while building telephone switches back in the early days of telephony – Led to Clos Network as the scalable interconnection network for meeting bandwidth requirements for end devices with commodity switches
•
Data center networks have taken the same approach
Clos Network What was good in the 50’s for telephony is good today for data centers! Advantage of Clos network • Use small sized commodity switches to connect large number of inputs to large number of outputs – Exactly 1 connection between ingress and middle stage – Exactly 1 connection between middle and egress stage
• • •
Constant latency between any two hosts If k >= n then Clos can be non-blocking like a crossbar switch Redundant paths between any 2 hosts
Recreated from source: http://web.stanford.edu/class/ee384y/Handouts/clos_networks.pdf
Data Center Networks Design • • • •
Adaptation of Clos network Use commodity Ethernet switches and routers Two- or three-level trees of switches or routers A three-tiered design –
Core layer at the root of the tree •
–
Aggregation layer in the middle •
–
Transition from Layer 2 switched access layer to Layer 3 routed core layer
edge layer (aka access layer) at the leaves of the tree •
•
Connects to the Internet via Layer 3 routers
Layer 2 switches connect to servers
A two-tiered design (less prevalent) – –
Core Edge
Data Center Networks Design Fat Tree – a special form of Clos • Same aggregate bandwidth. at each layer • Identical bandwidth at any bisection • Each port same speed as end host • All devices can transmit at line speed if packets are distributed uniformly along available paths • Scalability (with k-ary Fat tree) – k-port switch supports k3/4 servers
Fat tree routing Internet core
aggregation
access
Meeting Other Critical Design Considerations • Clos network topology and its variants allow meeting the goal of scalability while being cost effective • However, intelligent routing decision is key to ensuring performance between end hosts • Statically deciding the routing between any two servers will result in network congestion and hot spots despite the redundant paths available in the Clos network • This calls for an active network where the routing decisions are dynamic
Back to the Drawing Board…
Pkt = dst+payload
Turns out wanting network routing to be active is not a new thing! • Traditionally routing decisions in the Internet are static decided by Lookup Tables in the routers that are periodically updated •
Packet arrives, router looks up the table and sends it to the next hop based on destination field and the corresponding table entry
Active Networks was a vision proposed by Tennenhouse and Wetherall in the mid 90’s • Idea is for packets to carry code that can be executed in the routers en route from source to destination •
Routers could make dynamic routing decisions
Pkt = dst+code+payload
Unfortunately… • Active networks vision was way ahead of its time • Principal shortcomings of vision Vs. reality – Potential vulnerabilities due to protection threats and resource management threats of the routers – Need buy in from router vendors…since it “opens” up the network – Software routing (by executing code at the routers) cannot match “line” speeds of hardware routers
Old wine in a new bottle… • •
Resurrection of Active Networks…new name SDN Why now? – Confluence of two powerful forces 1) Wide area Internet (WAN traffic engineering) • Tech giants such as Google and Facebook want to build their own distribution networks • Control over how packets are routed possibly giving preferential treatment to some flows over others • Violates “net neutrality” principle but that is still a hot political debate 2) Cloud computing (Per customer Virtual Networks) • Service providers want to support dynamic routing decisions to ensure performance isolation for network traffic with multi-tenancy
– Chip sets for assembling routers are available which makes it easy to “build your own router” skirting the need for “buy in” from router vendors
SDN •
Eliminates the key drawbacks of the original Active Networks vision – Rule based routing as opposed to executing arbitrary code in the routers • Removes protection and resource management threats
– Routing for flows set centrally ONCE so that each routers can work at “line” speeds • Removes poor performance concern (of software routing)
• • •
Real breakthrough is separating control plane (rules setting for routing) from the data plane (transport of payload in the packet) Control plane is software controlled from a central entity Well-defined APIs to program the control plane – E.g. OpenFlow
SDN Architecture •
Elements of the architecture – Switch fabric implementing the desired network topology (e.g., Clos) – Central controller setting routing rules for the switches per traffic flow
•
Central controllers – Programmed using the APIs – State maintenance commensurate with control application directives and switch events – Set forwarding entries in the switches
•
Switch hardware – Simple rule-based traffic forwarding as instructed by the central controller
WAN and SDN •
The focus is network-wide traffic engineering at-scale – Meet network-wide performance goals – Infer traffic patterns and centralize control
Examples •
Tomogravity (Zhang, et al., appeared in ACM Sigmetrics 2003) Inferring end-to-end traffic matrix from link loads Routing Control Platform (RCP) (Casear, et al., appeared in Usenix NSDI 2005) – Logically central entity selecting BGP routes for each router in an Autonomous System (AS) 4D architecture (Greenberg, et al., appeared in SIGCOMM Comp. Comm. Rev. Oct 2005.) – Decision, Dissemination, Discovery, and Data – Separating an AS's decision logic from governing protocols among the network elements Google’s software-defined WAN (Koley, Google “Koley BTE 2014”) – Balancing network capacity against application priority using OpenFlow –
•
•
•
Cloud and SDN • • •
Focus of this course module Cloud provides the much needed “killer app” for SDN “Perfect storm” of need, control, and personnel – Need • Network traffic isolation for each customer • Network performance isolation for each customer
– Control • Complete control over the network fabric for the cloud provider • Datacenter network (and possibly inter-datacenter network fabric) is not OPEN to all Internet traffic as is the case with WAN
– Personnel • System developers (architecture, system software, and network gurus) all aligned on a common purpose
Cloud SDN Architecture Resource Managers (RM)
Control Plane
Observed State
System Objectives
Target State Network State Updater
Monitor
Network Fabric
Data Center
Clients
Cloud SDN Design Principles • •
RMs field client requests System objectives – SLAs with clients
•
RMs funnel requests to the control plane – Decide “target state”
•
Control drives the network fabric – Monitor and record “observed state” – Update goal: drive to “target state”
•
Net fabric – Stateless switches – Carry out controller’s strategy for packet forwarding
System Issues in Cloud Computing Virtualization Basics – Part 2 KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
Internet Traffic Patterns • Multiple independent data flows • “Well-behaved” (e.g., TCP) and “not so well-behaved” (e.g., UDP) flows injecting packets into the networks • Not to mention malicious traffic flow which may be indistinguishable at the packet level until it hits a host and detected at higher levels of the protocol stack
• Packet level issues • Out of order delivery • Packet loss (mostly due to dropped packets due to congestion) • When it happens…it happens in bursts…snowballing effect
• Packet corruption (very rare…less than 0.1%)
Network Congestion • Offered load to the network exceeds network capacity
• Problem is common to WAN and the Cloud • Recall Cloud network fabric uses the same commodity switching gear as WAN • TCP is the dominant transport protocol in both • Sender uses a dynamically sized window to decide when to inject new packets into the network
General Strategies for Congestion Control • Principle of conservation of packets • Equilibrium state • Stable set of data packets in transit
• Inject a new packet only when an old packet leaves
• Metered on-ramps to highways • Dynamically change the metering rate commensurate with observed traffic congestion
• Problems with conservation of packets • Sender injects packets before “old” leaves • Equilibrium state never reached due to resource constraints along the path
Congestion Avoidance • Slow start • Sender starts with a small send window (e.g., 1) • Increase the send window by 1 on each ACK aka “additive increase” • Problem • Potentially increase the send window exponentially despite “slow” start
• Conservation at equilibrium • Need good estimate of RTT to set retransmission timer • Upon retransmission • Halve the send window size (exponential backoff) aka “multiplicative decrease”
• Congestion avoidance • Combine Slow start and conservation at equilibrium • Current window size = send-window/2 upon retransmission • Current window size = send-window + 1 upon ACK
System Issues in Cloud Computing Virtualization Basics – Part 3 KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
Data Center Traffic Engineering Challenges • Scale of Data Center • 100s of thousands of servers => plethora of switches in the routing fabric
• Change is the order of the day • With so many components in a data center, failure is not an “if” question…it is a “when” question • VM migration for load balancing and failure tolerance
• Application characteristics and their traffic patterns • Implications of performance loss
Data Center Networks • Characteristics • Flow • Data center applications generate diverse mix of short and long flows
• Requirements • Applications tend to have the partition/aggregate workflow • Low latency for short flows • High burst tolerance for short flows
• Applications need internal state maintenance for freshness of data • High throughput for long flows
• Implications • Congestions can lead to severe performance penalty and loss of revenue
• On the bright side • Round trip times can be more accurately estimated • Congestion avoidance schemes can be tailored to take advantage of the data center ecosystem
System Issues in Cloud Computing Azure Networking Fabric KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
Overview •
Applications desire layer-2 semantics for server-server communication – Communication latency and throughput bound by the network interface speeds of the source and destination
•
Applications need elasticity of resource allocation – Grow and shrink computational resources based on need – Do not suffer network performance loss for such flexibility in resource allocation
•
Agility of data center – Ability to allocate resources on demand to meet the dynamic application needs – Ensure network performance scaling in the presence of such dynamic allocation
Limitations to Agility in Data Centers • Insufficient network capacity for connecting the servers • Conventional network architectures – Tree topology using high-cost hardware • Links over-subscribed as we reach higher levels of the tree
– Fragments server pool • Network congestion and server hotspots
– Network flooding of one service affects others – Prevents easy relocation of services when IP addresses are statically bound to servers
VL2 solution at a high level • Illusion of Virtual Layer 2 – Appears as though all servers for a given service connected to one another via non-interfering Ethernet switch – Scaling up or down of servers for a service maintains this illusion in tact
• Objectives – Uniform Capacity • Independent of topology, server-server communication limited only by NICs connected to the servers • Assigning servers to services independent of network topology
– Performance Isolation • Traffic of one service does not affect others
– Flexible assignment of IP addresses to Ethernet ports to support server mobility commensurate with service requirements
Attribution The slides that follow are used with permission from Microsoft Research, Sigcomm 2009 presentation Resource Information VirtualLayer2: A Scalable and Flexible Data‐Center Network Microsoft Research Changhoon Kim Work with Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Parantap Lahiri, David A. Maltz, Parveen Patel, and Sudipta Sengupta
Tenets of Cloud-Service Data Center • Agility: Assign any servers to any services – Boosts cloud utilization
• Scaling out: Use large pools of commodities – Achieves reliability, performance, low cost
Statistical Multiplexing Gain
Economies of Scale
What is VL2? The first DC network that enables agility in a scaled-out fashion
Why is agility important? – Today’s DC network inhibits the deployment of other technical advances toward agility
With VL2, cloud DCs can enjoy agility in full
Status Quo: Conventional DC Network Internet CR
DC-Layer 3 AR
AR
S
S
CR
...
AR
AR
DC-Layer 2
S
S
…
Key
S
S
...
• • • •
CR = Core Router (L3) AR = Access Router (L3) S = Ethernet Switch (L2) A = Rack of app. servers
…
~ 1,000 servers/pod == IP subnet Reference – “Data Center: Load balancing Data Center Services”, Cisco 2004
Conventional DC Network Problems CR
CR
~ 200:1 AR
AR
AR
AR
S
S
S
S
S
S
~ 40:1 S
~ S5:1 …
S
S
…
...
S
…
S
…
Dependence on high-cost proprietary routers Extremely limited server-to-server capacity
And More Problems … CR
CR
~ 200:1
S
AR
AR
AR
AR
S
S
S
S
S
S
S
S
… IP subnet (VLAN) #1
S
…
S
…
S
… IP subnet (VLAN) #2
Resource fragmentation, significantly lowering cloud utilization (and cost-efficiency)
And More Problems … CR
CR
~ 200:1 AR
S
AR
S
S
S
S
… IP subnet (VLAN) #1
Complicated manual L2/L3 re-configuration S
…
S
AR
AR
S
S
S
S
…
S
… IP subnet (VLAN) #2
Resource fragmentation, significantly lowering cloud utilization (and cost-efficiency)
And More Problems … CR
S
CR
AR
AR
AR
AR
S
S
S
S
S
S
S
S
…
Revenue lost
S
…
S
…
S
…
Expense wasted
Resource fragmentation, significantly lowering cloud utilization (and cost-efficiency)
Know Your Cloud DC: Challenges • Instrumented a large cluster used for data mining and identified distinctive traffic patterns • Traffic patterns are highly volatile – A large number of distinctive patterns even in a day
• Traffic patterns are unpredictable – Correlation between patterns very weak Optimization should be done frequently and rapidly
Know Your Cloud DC: Opportunities • DC controller knows everything about hosts • Host OS’s are easily customizable • Probabilistic flow distribution would work well enough, because … – Flows are numerous and not huge – no elephants! – Commodity switch-to-switch links are substantially thicker (~ 10x) than the maximum thickness of a flow DC network can be made simple
All We Need is Just a Huge L2 Switch, or an Abstraction of One CR
S
…
AR
AR
S
S
S
S
CR
...
S
…
...
S
AR
AR
S
S
S
…
S
S
…
All We Need is Just a Huge L2 Switch, or an Abstraction of One
1. L2 semantics
2. Uniform high capacity
…
…
3. Performance isolation
…
…
Specific Objectives and Solutions Approach
Solution
Employ flat addressing
Name-location separation & resolution service
2. Uniform high capacity between servers
Guarantee bandwidth for hose-model traffic
Flow-based random traffic indirection (Valiant LB)
3. Performance Isolation
Enforce hose model using existing mechanisms only
TCP
Objective 1. Layer-2 semantics
Addressing and Routing: Name-Location Separation Cope with host churns with very little overhead Switches run link-state routing and maintain only switch-level topology
ToR1
...
ToR3
y
payload
ToR34
z
payload
ToR2
x
...
ToR3
y,yz
Servers use flat names
...
Directory Service
ToR4
z
… x ToR2 y ToR3 z ToR34 …
Lookup & Response
Addressing and Routing: Name-Location Separation Cope with host churns with very little overhead Switches run link-state routing and maintain only switch-level topology
Directory Service
• Allows to use low-cost switches • Protects network and hosts from host-state churn • Obviates host and switch reconfiguration ToR1 . . . ToR2 ToR3 ToR4 ... ... ToR3
y
payload
ToR34
z
payload
x
y,yz
Servers use flat names
z
… x ToR2 y ToR3 z ToR34 …
Lookup & Response 24
Example Topology: Clos Network Offer huge aggregate capacity and multi paths at modest cost Int
...
...
Aggr
K aggr switches with D ports ... ...
TOR
20 Servers
...... ........ 20*(DK/4) Servers
Example Topology: Clos Network Offer huge aggregate capacity and multi paths at modest cost
Int
D (# of 10G ports) Aggr
48 96 144
... ...
TOR
20 Servers
...
Max DC size (# of Servers)
...
11,520 46,080
K aggr switches with D ports 103,680 ...... ........ 20*(DK/4) Servers
Traffic Forwarding: Random Indirection Cope with arbitrary TMs with very little overhead IANY
IANY
IANY
Links used for up paths Links used for down paths
T1 IANY
T53
zy
T2
T3
x
y
T4
T5
payload
z
T6
Traffic Forwarding: Random Indirection Cope with arbitrary TMs with very little overhead IANY
IANY
IANY
Links used for down paths
[ ECMP + IP Anycast ] • • • • IANY
T53
Harness huge bisection bandwidth Obviate esoteric traffic engineering or optimization Ensure robustness to failures Work with switch mechanisms available today T T T T T 1
zy
2
3
4
5
payload
x
y
Links used for up paths
z
T6
Does VL2 Ensure Uniform High Capacity? • How “high” and “uniform” can it get? – Performed all-to-all data shuffle tests, then measured aggregate and per-flow goodput Goodput efficiency
94%
Fairness§ between flows
0.995
§
Jain’s fairness index defined as (∑xi)2/(n∙∑xi2)
• The cost for flow-based random spreading
F§
1.00 0.96 0.92 0.88 0.84 0.80
Fairness of Aggr-to-Int links’ utilization
0
100
200
300
400 Time (s)
500
31
VL2 Conclusion VL2 achieves agility at scale via 1. L2 semantics 2. Uniform high capacity between servers 3. Performance isolation between services
Lessons • Randomization can tame volatility • Add functionality where you have control • There’s no need to wait! 32
System Issues in Cloud Computing DC Networking: Testing, Debugging, and Traffic Analysis KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
Headshot
(sf: a green screen shot here works and during post production. As Kishore talks about each of
the item he lists below, it will appear to his left or right- depending on which slide you have been using.)
In this Lecture, we will cover topics including metrics for network performance evaluation, tools for testing and debugging DC networks, and case studies of analyzing DC network traffic. The lecture will shed light on the following topics: (a) What can go wrong in DC networks? (b) Tools for testing and debugging (c) Tools for measurements and performance analysis, and (d) Case studies of measurements of DC networks
DC Networking: Testing, Debugging, and Traffic Analysis
OVERVIEW
Metrics Latency Throughput Utilization Scalability
Performance Evaluation Modeling Simulation Implementation and Measurements
What can go wrong in DC networks Tools for testing and debugging Tools for measurements and performance analysis Case studies of measurements of DC networks
DC Networking: Testing, Debugging, and Traffic Analysis
WHAT CAN GO WRONG?
What can go wrong? • External Resources Used • From UW-Madison class notes CS838 – http://pages.cs.wisc.edu/~ akella/CS838/F12/notes/sdn_testing_and_debugging. txt
• From 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13) – VeriFlow: Verifying Network-Wide Invariants in Real Time
Typical networking related problems
• Forwarding loops • Link failures • Forwarding inconsistencies -> often leads to forwarding loops • Unreachable hosts • Unwarranted access to hosts (which should not be reachable)
DC Vs. Traditional Networks • Forwarding loops – in traditional networks caused by failure of spanning tree protocols.
• Link failures – response is different, but problem is the same.
• Unreachable hosts – in traditional networks due to errors in ACLs or routing entries – in SDNs due to missing forwarding entries.
• Unwarranted access to hosts – in traditional networks due to errors in ACLs – in SDNs caused by unintended rule overlap
Challenges for DC Networks • Potential control loops – switches, controller, application
• End-to-end behavior of networks. – Networks are getting larger
– Network functionality becoming more complex – Partitioning functionality • Unclear boundary of functionality between network devices and external controllers
Effect of Network Bugs • Unauthorized entry of packets into a secured zone • Vulnerability of services and the infrastructure to attacks • Denial of critical services • Affect network performance and violation of SLAs
Approaches to Testing and Debugging
• Domain-specific Languages to minimize errors • Limited set of primitives • Symbolic execution and model checking • Static analysis of the network state • Live debugging
System Issues in Cloud Computing DC Networking: Testing, Debugging, and Traffic Analysis: Part 2 KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
Video title: Tools for testing, debugging, and verification • External Resources used – A NICE Way to Test OpenFlow Applications by Canini et al. (https:// www.usenix.org/sites/default/files/conference/protected-files/canini-nice.pdf) – OFRewind: Enabling Record & Replay Troubleshooting for Networks by Andreas Wundsam ( https://www.usenix.org/legacy/events/atc11/tech/slides/wundsam.pdf) – Where is the Debugger for my Software-Defined Network? By Handigol, et. Al. ( http://yuba.stanford.edu/~nikhilh/talks/NDB-HotSDN2012.pptx) – Header Space Analysis: Static Checking For Networks (https://www.usenix.org/sites/default/files/conference/protected-files/headerspace_nsdi.pdf) – Real Time Network Policy Checking using Header Space Analysis by Kazemian ( https://www.usenix.org/sites/default/files/conference/protected-files/kazemian_nsdi13_ slides.pdf) – VeriFlow: Verifying Network-Wide Invariants in Real Time by Khurshid et. al. ( https://www.usenix.org/sites/default/files/conference/protected-files/khurshid_ nsdi13_slides.pdf)
Survey of tools and techniques • • • • • •
Nice OFRewind NDB Header space analysis Netplumber Veriflow
NICE Goal Can we build a tool for systematically testing Data center applications?
Scalability Challenges • Data plane – Huge space of possible packets
• Control plane – Huge space of event ordering on the network
• Light at the end of the tunnel – Equivalence classes of data packets – Application knowledge of event ordering
NICE (No bugs In Controller Execution) • Tool for automatically testing OpenFlow Apps – Goal: systematically test possible behaviors to detect bugs
• Possible state-space is exponential => Combinatorial explosion with brute force approach
• NICE’s magic sauce – State-space exploration via Model Checking (MC) – Combine Symbolic Execution (SE) with Model Checking to prevent state-space explosion
Unmodifi ed OpenFlo w Program Network Topology Correctness Properties
Input
Nice Tool StateSpaceSearch
Traces of Property Violation
Output
System State: Switches End-hosts Communication channels
Survey of tools and techniques • • • • • •
Nice OFRewind NDB Header space analysis Netplumber Veriflow
OFRewind • Static analysis of OpenFlow programs • How does it work? – In Production • Record state (events and traffic) with minimal overhead
– Later • Replay state at convenient pace
– Troubleshoot • Reproduce problems at chosen times/locations
Keys to scalability of this approach • Record control plane traffic • Skip/aggregate data plane traffic • Replay
– Best effort as opposed to deterministic
Over-arching Goal • Partial recording/replay of chosen times/locations to reproduce problems
How to use OFRewind • Deploy OFRecord in production – “Always On” OF messages, control plane, data plane summaries – Selection rules as necessary
• Deploy OFReplay in the lab – Localize bugs and validate bug fixes
Survey of tools and techniques • • • • • •
Nice OFRewind NDB Header space analysis Netplumber Veriflow
ndb: Network Debugger Goal • Tool for live debugging of errant network behavior (similar to gdb for program control flow)
Approach • Use SDN architecture to systematically track down network bugs • Capture and reconstruct the sequence of events leading to the errant behavior • Network breakpoints defined by the user –Filter (header, switch) to identify the errant behavior
• Back trace generation –Path taken by the packet –State of the flow tables at each switch
How it works
B
Breakpoint is a filter on packet header • E.g. • Switches send “postcard” on matching entries to a central collector • Collector stores the postcards to construct a “backtrace”
S
A Collector
Survey of tools and techniques • • • • • •
Nice OFRewind NDB Header space analysis Netplumber Veriflow
Header Space Analysis
Elements of this technique • General model that is agnostic to the protocols and network topologies • Models the packet header (length L) as a point in an L-dimensional hyperspace • Models all network boxes as a transformer on the packet header space • Defines an algebra for this transformation – Composable, invertible,
• Models all kinds of forwarding functionalities regardless of specific protocols and implementations
Header
Payload
xxx0100xxx
xxxxxx
L Pin hi=0100
A
T
Pout ho=1010
B
Using the Model All traffic flows can be expressed as a series of transformations using the algebra Allows asking questions such as – Can two hosts communicate with each other? – Are there forwarding loops? – Are there network partitions?
Survey of tools and techniques • • • • • •
Nice OFRewind NDB Header space analysis Netplumber Veriflow
Netplumber • A system built on header space analysis • Creates a dependency graph of all forwarding rules in the network and uses it to verify policy – Nodes in the graph forwarding rules in the network.
– Directed Edges next hop dependency of the forwarding rules
Netplumber • A system built on header space analysis • Creates a dependency graph of all forwarding rules in the network and uses it to verify policy – Nodes in the graph forwarding rules in the network
– Directed Edges next hop dependency of the forwarding rules
Using Netplumber Represent forwarding policy as a dependency graph • Flexible policy expression – Probe and source nodes are flexible to place and configure
• Incremental update – Only have to trace through dependency sub-graph affected by an update to the forwarding policy
• Parallelization – Can partition dependency graph into clusters to minimize inter-cluster dependences
Using Netplumber Represent forwarding policy as a dependency graph • Flexible policy expression – Probe and source nodes are flexible to place and configure
• Incremental update – Only have to trace through dependency sub-graph affected by an update to the forwarding policy
• Parallelization – Can partition dependency graph into clusters to minimize inter-cluster dependences
Survey of tools and techniques • • • • • •
Nice OFRewind NDB Header space analysis Netplumber Veriflow
Veriflow • Tackles the problem of network-wide verification of traffic flows • Goal – Detect routing loops, black holes, access control violations
• Approach – Interpose verification layer between SDN controller and the network elements – Formulate network invariants from the SDN controller – Construct a model of network behavior – Monitor the network for violation of invariants
SDN controll er Veriflow
Network elements
System Issues in Cloud Computing DC Networking: Testing, Debugging, and Traffic Analysis: Part 3 KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
Video title: Network Traffic Characteristics • External resource used – Network Traffic Characteristics of Data Centers in the Wild by Benson, et.al . (https:// www.microsoft.com/en-us/research/wp-content/uploads/2010/11/DC -Network-Characterization-imc2010.pdf ) – A First Look at Inter-Data Center Traffic Characteristics via Yahoo! Datasets by Chen, et al. (http:// citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.228.8209&rep=rep1&ty pe=pdf ) – Inter-Datacenter WAN with centralized TE using SDN and OpenFlow (https://www.opennetworking.org/images/stories/downloads/sdnresources/customer-case-studies/cs-googlesdn.pdf)
DC Traffic Analysis A general study Classification of DC traffic by Yahoo Structure of Google WAN
DC Network Traffic Study
(Benson, et al.)
• Are links over-subscribed? • Is there sufficient bisection bandwidth? • Is centralization (via SDN controller) feasible?
Setup for the Study • Classic model of DC network – Core (L3), aggregation (L2), edge (TOR-L2) layers
• 10 data centers from three classes – University, Private enterprise, Cloud
• User community – Internal (university, private) and external (Cloud)
• Methodology – Analyze running apps using packet traces – Quantify network traffic from apps
Results Summary • Significant amount of small packets (~50% less than 200 bytes) – TCP acks, “I am alive” msgs
• Importance of Connection persistence • Traffic distribution – Clouds • Most traffic (75%) within a rack => good colocation of application components
– Other DCs • 50% inter-rack => un-optimized placement
• Link Utilization – Core > Aggregation > Edge – Bisection bandwidth sufficient (only 30% of the bisection used)
Insights from the Study • Are links over-subscribed? No – 75% traffic within a rack – Core links utilization < 25% – Need better load balancing, VM placement, and VM migration
• Is there sufficient bisection bandwidth? Yes – Small packet sizes – Utilization < 30%
• Is centralization feasible? Yes
DC Traffic Analysis • A general study • Classification of DC traffic by Yahoo • Structure of Google WAN
Classification of Traffic • D2C Traffic – The traffic exchanged between yahoo servers and clients
• D2D Traffic – The traffic exchanged between different yahoo servers at different locations
• Client – non-yahoo host connect to yahoo server
Methodology for collecting traffic data • Anonymized NetFlow datasets collected at the border routers of five major yahoo data centers – Dallas (DAX), Washington DC (DCP), Palo Alto (PAO), Hong Kong (HK) and United Kingom (UK)
• Meta data collected – timestamp, source and destination IP address, transport layer port number, source and destination interface on the router, IP protocol, number of bytes and packets exchanged
Key findings of the study • Yahoo! data centers are hierarchically structured • D2D traffic patterns – D2C triggered traffic • Smaller with higher variance commensurate with user dynamics
– Background traffic • Dominant and not much variance
• Highly correlated traffic at data centers – Replicated services at different data centers – Implications for distributing services at multiple data centers
DC Traffic Analysis • A general study • Classification of DC traffic by Yahoo • Structure of Google WAN
Google WAN • Characteristics – Global user base – QoS needs • High availability and quick response
• Implications – Rapid movement of large data across WAN
• Organization of the network – I-scale: Internet facing – G-scale: Inter-data center
G-Scale Network • OpenFlow powered SDN • Proprietary switches from merchant silicon and open source routing stacks with OpenFlow support. • Each site – Multiple switch chassis – Scalability (multiple terabits of bandwidth) – Fault tolerance • Sites inter-connected by G-scale network • Multiple OpenFlow controllers • No single point of failure
Google’s SDN and Traffic Engineering
• Collects real-time utilization metrics and topology data from the underlying network and bandwidth demand from applications/services • Computes path assignments for traffic flows and programs the paths into the switches using OpenFlow • Manages end-to-end paths resulting in high utilization of the links
Concluding Headshot In this lecture, we presented a smorgasbord of tools and techniques for testing, debugging, verification, and performance analysis of DC networks. The intent was to give a broad overview and not an in-depth study of any specific tool or technique. We have provided a lot of resources for the curious viewer to do further exploration. Also, in implementing, testing, debugging, and analyzing the comprehensive hands-on project we have designed for this module on SDN, these tools and techniques will come in handy.
C OLLEGE OF C OMPUTING , G EORGIA I NSTITUTE OF T ECHNOLOGY
SDN Workshop #1: Mininet and Topologies
In this module of the course, you will deploy an interconnection topology using Mininet.
1 E XPECTED OUTCOME The student would learn about network topologies, the basics of Mininet, and how to do measurements on networks.
2 S PECIFICATION Using the framework given implement the topology shown in figure 2. This topology has two levels: the core and the edge. Each of the links of the system has the same bandwidth. Figure 2 includes the parameters that can be modified when running the system. The last level contains hosts, that are the emulated computers that are connected to the network. The topology has to be configurable for the parameters: numCores (number of core switches), numEdgeSwitches (number of edge switches), hostsPerEdge (number of hosts computers connected to a given edge switches), bw (bandwidth of all the links) and delay (delay of the links). For this assignment, test and characterize the system using the following tools: wireshark, iperf and ping.
3 A SSUMPTIONS The student need to know the basics of using linux terminals.
1
Figure 2.1: Assignment Topology
3.1 U SEFUL REFERENCES • Introduction to the linux terminals • The cp command • Connecting via ssh to your server
4 TODO LIST The students need to verify that the following expectations are met: • The system correctly implements the topology: need to complete the mininet_topo.py file in the git repository. • Evaluate connectivity using ping between each pair of hosts: use pingall on the mininet CLI. • Evaluate bandwidth using iperf between each pair of hosts: change the test function in mininet_topo.py file to include a test between all possible pairs. • Understand the effect of rules and flows: use Wiresharek and the of filter, see the messages being sent using the openflow protocol . • Find the bottlenecks that may appear in the system by creating tests that force a worstcase scenario. The bottlenecks can happen due to the topology or due to the rules (or a combination of the two). Simple bottlenecks can be discovered using simple iperf instructions, while others would need to be forced installing flow rules on the OpenVSwitch software switches, using the ovs-ofctl command (more information can be found here)
5 E NVIRONMENT To complete this workshop. You need the following:
2
1. Open vSwitch 2. mininet 3. Wireshark 4. Ryu SDN Controller The base code for this workshop can be found in the workshop1 github repository. We suggest you to use the latest Ubuntu 20.04 LTS VM hosted on the freely available software Virtual Box. The workshop2 repo contains a setup script which you may use to prepare the VM for this workshop. Reboot the VM upon setup.
$ sudo ./setup.sh
6 C ONNECTING THROUGH SSH TO THE VM To be able to connect to your virtual machine through ssh, instead of using the GUI, you need to follow the next two steps: • Add host-only network -> file menu/Host Network Manager and "Create host-only network" button with default settings. • Select your VM and go to the Setting Tab. Go to Network->Adapter 2. Select the "Enable Adapter" box and attach it to "host-only network" which was created in the previous step.
7 W ORKSHOP For the workshop, we are going to first follow these two tutorials: learn the tools and Ryu tutorial. Once we have completed the tutorials and learned the basics of mininet and ryu, then we are going to start working on the assignment explained earlier.
7.1 H OW DO WE CREATE A TOPOLOGY ? Mininet give us the ability to create custom topologies using Python. You are going to use the file mininet_topo.py as the starting point for implementing the topology shown in figure 2. 7.1.1 A DD SWITCHES Adding a switch to the topology is as simple as using the following API:
switch = self.addSwitch(, protocols='OpenFlow13') where name is the name of the newly added switch. For more information go here. HINT: Switch naming convention is important. The names given in the diagram may not work as a valid switch name. Consider using canonical names for switches.
3
7.1.2 A DD HOSTS Adding a host to the topology is as simple as using the following API:
host = self.addHost() where host name is the name of the newly added host. For more information go here. 7.1.3 A DD LINKS To add links between switches or hosts we use the following function:
configuration = dict(bw=, delay=, ,→ max_queue_size=, loss=0, use_htb=True) addedLink = self.addLink(, , **configuration) The configuration dictionary contains the information associated to the link to be created, including bandwidth, delay and maximum queue size. On the function call to addLink, we define the name of the two elements to be connected and dictionary element.
8 RUNNING THE CODE Copy the file mininet_topo.py to the Virtual machine folder called ∼/mininet/examples/. Open a new terminal, go to the mininet folder (cd ∼/mininet) and run:
$ sudo python3 examples/mininet_topo.py This command is going to start executing the custom topology. Copy the file ryu_controller.py to the Virtual machine folder called ∼/ryu/app. After that, open a second terminal. Go to the ryu folder (cd /ryu) and run:
$ ryu-manager
--verbose app/ryu_controller.py
This command is going to run the controller showing all the events that are happening. As you probably noticed, figure 2 implements a topology that has loops, to be able to cope with the loops the controller calcualtes an spanning tree of the connection to generate a directed graph without loops such that when running commands the packets don’t end up in a live lock. You can read how this works in more detail, on the ryu book webpage here. After the controller is initiated, press enter on the mininet terminal. Wait a couple of minutes until the setup of all the switches are done on the ryu controller terminal, the amount of messages flowing through the terminal is going to reduce considerably. After this is done you can start running tests on the mininet terminal. To stop the program, type exit in the first terminal and ctrl + C in the second terminal. Before the next execution, run sudo mn -c to clear the state. This is an important step, otherwise unexpected errors can arise.
4
9 V ERIFY YOUR CODE You can verify the configuration of your system using the ovs-ofctl command and other commands from the mininet CLI.
9.1 H OSTS To verify that all the nodes in the mininet topology were created we can use the following command:
mininet> nodes This function is going to show all the switches, controllers and hosts that were created with the custom topology.
9.2 S WITCH AND PORT INFORMATION To get the information of all the ports created use the following command on the mininet CLI:
mininet> sh ovs-ofctl show
--protocols=OpenFlow13
where is the name given to the switch when it was created in section 7.1.1. To obtain the name of the switches and general information you can use the following command on the mininet CLI:
mininet> sh sudo ovs-vsctl show with the two previous commands you can verify that the switches were created correctly.
9.3 LINKS To verify the links were created correclty, we use the command:
mininet> links This is going to show the interfaces that are connected, then you can use the information from section 9.2, and correlate that to other parameters.
9.4 T EST CONNECTIVITY mininet> pingall
9.5 T EST THE LATENCY mininet> ping where host1 is the initiating host name and host2 is the receiving host.
5
9.6 T EST BANDWIDTH mininet> iperf where host1 is the initiating host name and host2 is the receiving host for the bandwidth testing.
10 D EMO The demo for this workshop covers the following. You should ensure that your topology is running before you start your grading session. • List nodes on the command line to show that the given topology has been implemented. • Run pingall to demonstrate that all hosts are reachable. • Run your test function, which should loop through all tests and perform an iperf function between host pairs. • Show a wireshark trace between a switch and the controller and show the of filter in action. You may be asked a simple question to demonstrate a basic understanding of the OpenFlow messages. • Discuss with your grader what bottlenecks you observed.
11 D ELIVERABLE Deliverables for this week are as follows: • Any topologies (mininet_topo.py and others) and setup scripts you’ve created in order to run your topologies. • A report containing answers to the 5 bullet points listed under the TODO section of workshop • Where you are asked to verify certain things, this can be as simple as a screen shot of a terminal session or wireshark, or a brief explanation of what you observed. There is no prescribed format here, but a PDF ensures we don’t run into compatibility issues. If submitting a git repo, a markdown file clearly named in your repo would also be acceptable.
12 U SEFUL R EFERENCES If you want to understand each of the steps on this workshop, all the references can be found on the following links:
6
• Introduction presentation • Introduction to Mininet • Mininet Walk-through • Mininet Documentation • Mininet Paper • OpenFlow Tutorial (Ryu) • Ryu Documentation • Ryu Tutorial • Ryu Spanning Tree
7
C OLLEGE OF C OMPUTING , G EORGIA I NSTITUTE OF T ECHNOLOGY
SDN Workshop #2: VLAN and Northbound connections on SDN
This workshop is based Ryu router example from the Ryu book. The example that can be found here.
1 E XPECTED OUTCOME The student is going to learn about the Northbound connection to OpenFlow Controllers. It is going to learn how to externally get information about the topology seen by the controller and create rules for the flow in the network. Additionally, the student is going to learn how to implement Virtual Local Area Network (VLAN) using OpenFlow and the corresponding REST API.
2 A SSUMPTIONS This workshop assumes that you know how to do REST HTTP requests, plus all the required knowledge from the previous workshop.
2.1 U SEFUL REFERENCES • What is REST? • How to test a REST api from command line with curl • Postman - REST Client
1
3 S UGGESTIONS FOR THE WORKSHOP • Create a bash file for each of the steps exposed afterwards, so replication is easier. You could create one directory for each element in the network (each switch and host) and the corresponding bash scripts on them.
4 M ININET 4.1 C REATE THE TOPOLOGY Using mininet create the topology shown in figure 4.1, running the following command:
$ sudo mn --topo linear,3,2 --mac --switch ovsk --controller remote -x In this command, we create three switches with two hosts on each. Using –mac, the flag automatically generates the MAC addresses with a readable value. Each switch is of type OpenVSwitch (–switch ovsk) with a remote controller (in our case Ryu).
Figure 4.1: Linear Topology
4.2 A SSIGN THE CORRECT O PEN F LOW VERSION On each of the switch terminals (s1, s2, and s3), run the following command to assign the correct version of OpenFlow to the switches:
$ ovs-vsctl set Bridge protocols=OpenFlow13 where is the corresponding switch name. The ovs-* commands allow us to control the different properties and operations of an Open VSwitch.
2
5 VLAN Now we are going to create the virtual network on each of the hosts, using the following commands: 1 2
3 4
$ ip addr del /8 dev $ ip link add link name ,→ . type vlan id $ ip addr add /24 dev . $ ip link set dev . up where • < I P >: it is the IP obtained using ifconfig, on the first network interface, e.g., 10.0.0.1 • < host _net wor k_i nt er f ace >: is the name of the first network interface obtained when running ifconfig, e.g., h1s1-eth0. • < vl an_i d >: is the vlan number shown in figure 4.1. • < new_i p >: is the ip shown in figure 4.1. The first command deletes the default ip assigned to the host for the given interface. The second command creates the virtual network, using a new network interface. All traffic routed will go through the main interface, but with a VLAN tag. You can verify the creation using ip link. The third command adds an IP to the new interface. The fourth command enables the interface to be accessed.
5.1 T URN ON THE CONTROLLER We are going to use the Rest Router application that is included in Ryu. Clone the ryu github repo from here. Initiate the controller (c0) running the following commands:
$ cd ~/ryu/ $ ryu-manager ryu/app/rest_router.py After initiating, it should show that there is a WSGI application running at port 8080, and the information of all the switches that were previously deployed.
5.2 N ORTHBOUND : U SING THE REST API Ryu controllers can have an API that allows connection from external applications. using a REST interface. This connection is known as the Northbound interface, it can be used for external configuration of the network. For this example it would allow us to define the configuration parameters of the switches and static routes.
3
5.2.1 S ET S WITCH A DDRESS We can set-up the address of a switch using a POST query, described as follows: ———————— Setting an Address ———————— URL /router/switch[/{vlan}] -switch: [ "all" | Switch ID ] -vlan: [ "all" | VLAN ID ] Data address:"< xxx.xxx.xxx.xxx/xx >" We can use this API using the linux command curl as, run this in the command line for the controller (xterm c0 to open the terminal from the mininet CLI):
c0> ,→
curl -X POST -d '{"address": "$$/24"}' http://localhost:8080/router/$$/$ • < new_i p >: IP to be assigned to the switch, e.g., 172.16.10.1 • < swi t ch_i d >: Switch ID, assigned by Ryu. For this assignment, s1 is 0000000000000001, s2 is 0000000000000001 and s3 is 0000000000000003. • < vl an_i d >: is the id of the vlan, as shown in figure 4.1, e.g. 2 and 110.
Now, we are going to use this API to set the address for each switch. First, set the addresses "172.16.10.1/24" and "10.10.10.1/24" to router s1. They must be set to each VLAN ID (2 and 110). Do the same for s2 ("192.168.30.1/24" and "10.10.10.2/24") and s3 ("172.16.20.1/24" and "10.10.10.3/24"). The 10.10.10.x addresses are used for internal communication between the switches, the 172.16.x.x and 192.168.30.x addresses are used for communication between switches and hosts.
5.3 R EGISTER THE GATEWAY For each of the hosts we need to define the default gateway on the switches, using the following command:
$ ip route add default via where: • < swi t ch_i p >: is the ip that was previously assigned to the switches.
4
Assign to the hosts shown in figure 4.1 the following gateways: • h1s1,h2s1: 172.16.10.1 • h1s2,h2s2: 192.168.30.1 • h1s3,h2s3: 172.16.20.1 The command has to be run on the hosts command line (open a host command by running xterm on the mininet CLI) or run then directly on the mininet CLI, by running:
mininet> ip route add default via
5.4 S TATIC AND D EFAULT R OUTES 5.4.1 D EFAULT R OUTES Now we need to setup the default routes for each of the switches using the following API: ———————— Setting Default Route ———————– Method POST URL /router/switch[/vlan] -switch: [ "all" | Switch ID ] -vlan: [ "all" | VLAN ID ] Data gateway:"< xxx.xxx.xxx.xxx >" We setup the default routes as follows: • Set router s2 as the default route of router s1. • Set router s1 as the default route of router s2. • Set router s2 as default route of router s3. Note: Use the 10.10.10.X routes to define the gateways. You need to use curl as in section 5.2.1 to do the POST request, the operation needs to be done on terminal c0. 5.4.2 S TATIC R OUTES Finally we are going to define the static route between the hosts using the following API:
5
———————Setting Static Routes ———————Method POST URL /router/switch[/vlan] -switch: [ "all | Switch ID ] -vlan: [ "all" | VLAN ID ] Data destination:"< xxx.xxx.xxx.xxx/xx >" gateway:"< xxx.xxx.xxx.xxx >" For the s2 router, set a static route to the host (172.16.20.0/24) under router s3. Only set if vlan_id=2. 5.4.3 T EST THE RULES You can verify that all the rules were correctly installed using the following command:
c0> curl http://localhost:8080/router/all/all If the previous rule was correctly entered then the following ping run from the h1s1 command line should work correctly:
h1s1> ping 172.16.20.10 But the following ping run from the h2s1 command line should fail:
h2s1> ping 172.16.20.11 They’re both in the same vlan, but since router s2 doesn’t have a static route set to s3, it cannot communicate successfully. 5.4.4 A DDITIONAL S TATIC R OUTES Create the required route such that h2s1 and h2s3 can communicate between each other.
6 O PTIONAL Create a python program that can manage the controller using the Northbound connection to the controller. You can try to use the coreapi or requests libraries to make the REST calls. The
6
python program need to complete all the static routes required to do all the communications inside each VLAN, use the API specification in section 6.1. Additionally, Try to answer the following questions: • Verify that only the hosts in the same vlan can communicate between them. • Can hosts in different vlans have the same IP? • How would you create an external application that creates the VLAN’s autonomously? • How would you install firewalls inside a given VLAN, using only the API’s that you currently have? • How would that design scale in a cluster with hundreds of machines?
6.1 C OMPLETE R OUTER R EST API ———————— Acquiring the Setting ———————— Method GET URL /router/switch[/vlan] -switch: [ "all" | Switch ID ] -vlan: [ "all" | VLAN ID ] Specification of VLAN ID is optional. ———————— Setting an Address ———————— Method POST URL /router/switch[/{vlan}] -switch: [ "all" | Switch ID ] -vlan: [ "all" | VLAN ID ] Data address:"< xxx.xxx.xxx.xxx/xx >"Remarks Remarks Perform address setting before performing route setting. Specification of VLAN ID is optional.
7
———————Setting Static Routes ———————Method POST URL /router/switch[/vlan] -switch: [ "all" | Switch ID ] -vlan: [ "all" | VLAN ID ] Data destination:"< xxx.xxx.xxx.xxx/xx >" gateway:"< xxx.xxx.xxx.xxx >" Specification of VLAN ID is optional. ———————— Setting Default Route ———————– Method POST URL /router/switch[/vlan] -switch: [ "all" | Switch ID ] -vlan: [ "all" | VLAN ID ] Data gateway:"< xxx.xxx.xxx.xxx >" Specification of VLAN ID is optional. ——————— Deleting an Address ——————— Method DELETE URL /router/switch[/vlan] -switch: [ "all" | Switch ID ] -vlan: [ "all" | VLAN ID ] Data address_id:[ 1 - · · · ] Specification of VLAN ID is optional.
8
——————— Deleting a Route ——————— Method DELETE URL /router/switch[/vlan] -switch: [ "all" | Switch ID ] -vlan: [ "all" | VLAN ID ] Data route_id:[ 1 - ... ] Specification of VLAN ID is optional.
7 D EMO The demo for this workshop covers the following. You should ensure that your topology is running and before you start your grading session. • Show to the grader the static rules that have been installed during this workshop. You will make a REST API call using methods described in section 5.4.3. • Do a ping between hosts that the grader will provide to you. You will be asked to ping between at least two different hosts. • You may be asked questions on the topolgy and the routes that have been created, to demonstrate your understanding of the available routes between hosts.
8 D ELIVERABLES The deliverables for this workshop includes all scripts created in order to fully setup the topology such that it is possible to ping between hosts. This may include bash scripts, or Python code if you have completed optional section 6. You will submit your repo and commit ID on Canvas.
9 U SEFUL R EFERENCES This workshop is an adaptation of the following tutorials in the Ryu documentation: • OFCTL Ryu Api • VLAN in Ryu Other useful references: • What is a REST Api?
9
• Django API client • Incorporate REST api into the controller.
10
C OLLEGE OF C OMPUTING , G EORGIA I NSTITUTE OF T ECHNOLOGY
SDN Workshop #3: Using Ryu to handle SDN
In this module of the class, you will create your own set of rules to load balance paths in the presence of multiple paths between two hosts. Additionally for each path you are going to enforce a different type of quality of service (QoS) agreement. This workshop is based on the Multipath Load Balance example from muzixing, which can be found here (here is a translation in English).
1 E XPECTED OUTCOME The student will learn about: • The code structure of a Ryu controller application. • The main messages sent between the controller and the switches. • How the flow table is populated. • How the controllers learn the closest port of other switches and hosts for a given switch, to avoid flooding the network each time a new packet appears. • How to control OpenVSwitch queues to enforce rates. • How to create group rules using the Ryu controller.
2 A SSUMPTIONS This workshop assume that the student knows the basics of Python programming, and know how to use decorators.
1
2.1 U SEFUL R EFERENCES • Python Tutorial • Python Decorators
3 S PECIFICATION Using the framework given implement the set of rules required to load balance paths in the presence of multiple paths between two hosts. The topology shown in figure 5.1 has a loop, which means that there are two possible paths to connect h1 to either h2 or h3. You are going to implement a set of static rules on the Ryu controller to distribute the flow between paths during a communication between the hosts. In this example, the communication path between host 1 and host 2 is going to be different than the communication path between host 1 and host 3.
4 G ET CODE Download the code from the repo: Workshop 3 git
5 I MPLEMENTATION 5.1 T OPOLOGY Create the topology in figure 5.1 on the file loadbalancer_topo.py. Please make sure to use the same ports as shown on the topology image (ports are shown between parenthesis).
Figure 5.1: Topology for the network
Copy the final file to the corresponding location:
$ cp loadbalancer_topo.py ~/mininet/examples
2
5.2 RYU RULES Open the file loadbalancer_controller.py. 5.2.1 L EARN THE BASICS OF THE RYU API Using the links on section 9, complete the variable assignments that have the comment TODO: 1). The important part is the 1).
5.3 C OMPLETE THE MULTIPATH CALCULATION First, you need to create the actions associated to the switch one, go to the comment #TODO: Complete switch one We first create the actions for the two ports connected to switch 2 and switch 3. Here we select queue 0, and assign the action to the corresponding port. The queues are different for switch 2 and switch 3.
#group 7 port_1 = [out_port_1] queue_1 = parser.OFPActionSetQueue(0) actions_1 = [queue_1, parser.OFPActionOutput(port_1)] port_2 = [out_port_2] queue_2 = parser.OFPActionSetQueue(0) actions_2 = [queue_2, parser.OFPActionOutput(port_2)] The objective of the group rule is to split the the packets coming from host 1 between the two available paths depending on the destination host address. For the following subsection of the code, the most relevant points are: • We create the weights associated to each path. The meaning of this weights depends of the implementation of the OpenFlow Protocol. More information can be found in section 10 • We are analyzing all the ports and all the groups. • We create two buckets associated with the objects created before. More information in section 10. • And we send the group modification command to the switch.
weight_1 = 50 weight_2 = 50
3
watch_port = ofproto_v1_3.OFPP_ANY watch_group = ofproto_v1_3.OFPQ_ALL buckets = [ ofp_parser.OFPBucket(weight_1, watch_port, watch_group, ,→ actions_1), ofp_parser.OFPBucket(weight_2, watch_port, watch_group, ,→ actions_2)] group_id = [group_id] req = ofp_parser.OFPGroupMod(datapath, ofproto.OFPFC_ADD, ofproto.OFPGT_SELECT, group_id, ,→ buckets) datapath.send_msg(req) Try to answer the additional questions tagged with comment TODO: Final Questions. Copy the file to the corresponding location:
$ cp loadbalancer_controller.py ~/ryu/app/
6 D EPLOY AND T EST NOTE: There is a known problem with the provided controller that severely limits bandwidth between h1 and h3. This is due to h3 never having to send an ARP request, so not all switches will learn its MAC address, resulting in excessive packet flooding. A simple way around this is to make h3 ping an unreachable host (mininet> h3 ping 10.0.0.4) before creating flows in the network.
6.1 RUN THE CODE Run the topology using mininet. First, go to ∼/mininet/examples/ (cd ∼/mininet/examples)
$ sudo mn --custom loadbalancer_topo.py --topo topology ,→ --controller=remote --mac Run the code for the controller. Go to ∼/ryu/ (cd ∼/ryu)
$ ./bin/ryu-manager ryu/app/loadbalancer_controller.py --verbose
6.2 V ERIFY RULES Verify the initial rules for each of the four switches (run this command on the mininet CLI)
mininet> sh ovs-ofctl -O OpenFlow13 dump-flows
4
6.3 S ET QUEUES We can control the allowed bandwitdh for a given group by using the queue parameters as shown next (run it on the mininet CLI):
mininet> sh ovs-vsctl -- set Port qos=@newqos \ -- --id=@newqos create QoS type=linux-htb \ other-config:max-rate= queues=0=@q0\ -- --id=@q0 create Queue other-config:min-rate=\ other-config:max-rate= where:
This is the name of the network interface for the given switch in which the queue is to be installed, e.g. s1-eth2. As learned on workshop #2.
This is the max rate in bps that the queue is going to allow, based on QoS newqos.
: queue minimum rate. : queue maximum rate. Notice that we could use groups with only one bucket to assign a given bandwidth for a group of flow matches. For this workshop we are going to assign this values as shown in equation 6.1. qos_max_r at e = queue_mi n_r at e = queue_max_r at e = r at e
(6.1)
We assign rate to be 300 000 000 for s1-eth2 (switch 1 port 2) and 150 000 000 for s1-eth3 (switch 1 port 3). These are the ports associated with the group rule.
6.4 V ERIFY QUEUES mininet> sh ovs-vsctl list queue
6.5 V ERIFY CONNECTIVITY Use pingall to verify connectivity between hosts as follows:
mininet> pingall All hosts should be reachable at this point.
6.6 RUN THE TESTS 6.6.1 I PERF BETWEEN H 1 AND H 2 Run iperf between host 1 and host 2 using the following command:
mininet> iperf h1 h2 You can verify the bandwidth restction that you added in section6.3.
5
6.6.2 C HECK BYTES SENT You can verify how many bytes are sent for each of the groups in switch 1 using the following command:
mininet> sh ovs-ofctl -O OpenFlow13 dump-group-stats At this point, most of the packets should be sent through only one of the buckets. 6.6.3 I PERF BETWEEN H 1 AND H 3 Run iperf between host 1 and host 3 using the following command:
mininet> iperf h1 h3 6.6.4 C HECK BYTES SENT Use the same commands as in section 6.6.2. There should be plenty of bytes sent through both buckets, which means that the rule is correctly splitting the messages depending on the destination host.
6.7 C HECK AGAIN THE RULES Now verify again the rules and groups that are on the switches:
mininet> sh ovs-ofctl -O OpenFlow13 dump-flows where is the name of the switch as learned on Workshop #2.
6.8 C HECK THE GROUP CREATION sudo ovs-ofctl -O OpenFlow13 dump-groups s1 You should be able to verify the weight used for the buckets on the controller with these measurements, and the corresponding groups.
6.9 C HECK QUEUE INFORMATION sudo ovs-ofctl queue-stats
-O OpenFlow13
where is the name of the switch and is the numerical id of a given port.
6.10 C LEANING QUEUES When you finish running the system you need to clean mininet by using sudo mininet -c and erase all the associated queues.
$ sudo ovs-vsctl --all destroy qos $ sudo ovs-vsctl --all destroy queue We need to erase first the QoS objects, otherwise it is going to complain when erasing the queues.
6
7 D ELIVERABLES By the end of this workshop, you will submit your modified loadbalancer_topo.py file and a PDF report containing screenshots of the following: 1. The topology using the visualization tool (you can learn more about this tool here). 2. The results of the rule dump on the switch showing clearly the packets flowing through both paths, for the corresponding rules. 3. Test runs of iperf verifying the max rate available for each path.
8 D EMONSTRATION During the synchronous class you must demonstrate to the grader the following: 1. Start with the network topology running but before running any iperf sessions run hosts. 2. Run pingall to show all hosts are reachable. 3. Run iperf between hosts h1 and h2, then dump group stats to show that most of the bytes sent are through one bucket. 4. Run iperf between hosts h1 and h3, then dump group stats to show that bytes sent have now been load balanced between the two paths.
9 U SEFUL R EFERENCES This workshop is a modified version of the following tutorial: • Load-balance multipath application on Ryu These are other additional references that could help you understand better how Ryu and OpenFlow works. • Understanding the Ryu API: Dissecting Simple Switch • Ryu Description and L2 Switch example • How to program Ryu Applications? • Official Ryu Book • Ryu Tutorials • Full code description of a simple switch hub. Please take a look at the section Execution of Mininet, it has useful examples on how to use commands to inspect the status of the network. • How to monitor the status of the network?
7
10 A DDITIONAL N OTES 10.1 B UCKETS AND GROUPS IN O PEN VS WITCH The OpenFlow specification can be too broad in certain parts (does not enforce a specific policy). For example, on OpenVSwitch depending on the version the bucket selection function changes. Weight is not a factor when choosing a bucket. The following description was obtained from the mailing list of OpenVSwitch: In Open vSwitch 2.3 and earlier, Open vSwitch used the destination Ethernet address to choose a bucket in a select group. Open vSwitch 2.4 and later by default hashes the source and destination Ethernet address, VLAN ID, Ethernet type, IPv4/v6 source and destination address and protocol, and for TCP and SCTP only, the source and destination ports. The hash is "symmetric", meaning that exchanging source and destination addresses does not change the bucket selection. Select groups in Open vSwitch 2.4 and later can be configured to use a different hash function, using a Netronome extension to the OpenFlow 1.5+ group_mod message. For more information, see Documentation/group-selection-method-property.txt in the Open vSwitch source tree. (OpenFlow 1.5 support in Open vSwitch is still experimental.) The specification of OpenFlow 1.3, exposes the selection of buckets in a group as follows: [optional ]select: Execute one bucket in the group. Packets are processed by a single bucket in the group, based on a switch-computed selection algorithm (e.g. hash on some user-configured tuple or simple round robin). All configuration and state for the selection algorithm is external to OpenFlow. The selection algorithm should implement equal load sharing and can optionally be based on bucket weights. When a port specified in a bucket in a select group goes down, the switch may restrict bucket selection to the remaining set (those with forwarding actions to live ports) instead of dropping packets destined to that port. This behavior may reduce the disruption of a downed link or switch. The highligted section shows how another switch could leverage the use of weights to choose a given path for a specific flow. This could be useful for implementing ECMP rules.
8
College of Computing, Georgia Institute of Technology
Project #1: Software-Defined Networks
1 Expected Outcome The student is going to implement a set of OpenFlow rules that can adapt to network topologies changes and to traffic changes. Additionally, to be able to tests this rules, you are going to implement a host service in which you can control the traffic going in and out of that service.
2 Sections of the Project 2.1 Monitor Your need to instrument your code to be able to debug and understand what the network is doing. More specifically your code should print periodically, with period T1, the following information: • Per port information: – byte count – drop count – error count • Per flow information: – byte count – duration – timeouts
1
T1 should be a parameter of the controller.
2.2 Useful References • Ryu Traffic Monitor • Passing parameters (documentation is sparse for this functionality).
2.3 Topology Discovery Create a graph of the topology that Ryu application is going to be controlling. This topology should adapt to changes in the topology when switches go up or down. The graph can be created using any graph library(e.g. igraph, NetworkX etc.). You need to use the flag "—observe-links" when running the Ryu Controller, to be able to see the changes. Each time a switch is discovered or deleted, a log should be written into the terminal. Given that the Ryu Controller cannot detect the bandwidth and latencies of the connected links. We are going to define a configuration file "link_config" that is going to contain the information related to the files, with each configuration as follows: 1
{ "input_port": , "output_port": , "bandwidth": , "latency":
2 3 4 5 6
} where: p1: is the port on the source switch. p2: is the port on the destination switch. bw: is the bandwidth associated with the link in MB, e.g. 100 will mean 100 Mbps bandwidth lat: is the latency associated with the link in ms, e.g. 2 will mean 2 ms latency The configuration file should be given as an input to both Mininet and the Ryu controller. This configuration should give you enough flexibility to do interesting tests, without having to write all the links in the system. However, this link file is not definitive - if you choose you can include the switch names for more fine grained link development. The only rule is - network discovery(i.e., when your controller discovers the network topology) cannot be done through the link configuration file.
2
The bandwidth and latencies associated with the link connecting a host to a switch should be: 1
{"bandwidth": 100, "latency": 2}
2.3.1 Useful Ryu Documentation • Ryu events (located at /ryu/topology/event.py): – EventSwitchEnter – EventSwitchLeave – EventSwitchReconnected – EventLinkAdd – EventLinkDelete • Functions (located at ryu/topology/api.py): – get_switch – get_link 2.3.2 How to test your code? You can test this code section by using the following mininet commands: link down: takes down the link between node1 and node2. link up: brings up the link between node1 and node2. and could be either switches or hosts.
2.4 Static Rules 2.4.1 Shortest-path Create a Ryu application that uses the topology graph created in the Topology module to create rules that implement the shortest path. Shortest path is defined as the latency with minimum path latency. The corresponding handler (PacketIn) needs to install entries that match packets based on the destination MAC address and Ethernet type. The switch should send the packet through a port that is connected to a path that is the shortest path to the destination. Specific details • If multiple paths have the same length, choose the next hop randomly.
3
• Your Ryu application should be able to handle loops in the topologies. • Each time the shortest path is found, install the corresponding rules in all the switches in the path. • To calculate the shortest path use the functions available in your graph library. Additionally, you should remember that SDN switches don’t learn MAC addresses by themselves so: • Each time an incoming message from a host comes, verify if you already know its MAC address and its associated IP address. Notice that the port to which the host is connected can be useful later. • If you don’t know the MAC address of the destination, then you would need to flood the network (ofproto.OFPP_FLOOD). You can find more information about this process here. Be careful, because this implementation is not the same as the one you are going to use, because of the additional step of calculating the shortest path. 2.4.2 Widest-path Similar to the shortest-path, create a Ryu application that uses the topology graph created in the Topology module to create rules that implement the widest path. The corresponding handler (PacketIn) needs to install entries that match packets based on the destination MAC address and Ethernet type. The switch should send the packet through a port that is connected to a path that is the widest path to the destination. Specific details • The widest-path is determined by the bandwidth of the links. • If multiple paths have the same bandwidth, choose the next hop randomly. • Your Ryu application should be able to handle loops in the topologies. • Each time the widest path is found, install the corresponding rules in all the switches in the path. • To calculate the widest path you need to implement a small modification of the Dijkstra algorithm. It is not sufficient to invert the bandwidth values of the graph and then run unmodified Dijkstra, as Dijkstra does not work with negative weights. Your modification to Dijkstra’s algorithm should maximize the minimum edge weight of the graph. A brute force approach using iGraph’s or NetworkX’s all simple paths algorithm will not be accepted.
4
2.5 Proactive Rule The rules described previously are calculated once and don’t take into account statistics of network usage as the ones obtained in the Monitoring section. You are going to add additional logic to the Widest-path controller to be able to adapt based on changes in the network. For each flow in the network, you are going to maintain a list that contains the bandwidth that had been used, as captured by the Monitor module. The size of the list, S1, should be a parameter of the controller input, similar to T1 in the Monitor module. S1 will define how many data points will be stored. For example if T1 is 5 and S1 is 5, your S list will contain 5 data points taken at 5 second intervals. This list should act as a FIFO queue so taking the average of this list will give you a rolling average bandwidth usage for the past 25 seconds in the example given. Rather than taking an average, students are encouraged to explore other statistical measures outside of simply taking the average bandwidth depending on assumptions about the workloads that will be used in a datacenter environment. Consider tradeoffs related to length, bandwidth, and transiency of flows along with the network topology used itself. Each time a new flow need to be installed (PacketIn event), subtract the average of the list for each link from the total bandwidth available at each link in the original graph, and calculate the next hop using the highest width available path. If the host seems unreachable fall back to the static rules. 2.5.1 Redistribute Additionally, every T2 seconds you are going to redistribute the flow. The controller should maintain the information related to the bytes sent between two hosts (src,dst,bytes), called comm_list. This data structure can have more information associated to it, as you see fit for your implementation. Using this information the controller should implement the following scheme: 1. Initialize the topology graph with the default bandwidth values. 2. Initialize the list of rules to be installed to empty. 3. Sort the comm_list from more packets to fewer packets sent. 4. For each element in comm_list: a) Find the widest path and add the required rules to the list of rules to be installed. b) Reduce the corresponding links of the topology graph with the current average of bytes sent for the given (src,dst,packet) tuple. 5. Apply all the generated rules
5
T2 is also an input parameter of the controller. Specific details • If any host is unreachable after calculating the newer widest paths on step four, then the new rules should not be applied. • S1 and T2 should be parameters of the controller. Suggestions • Be careful in how you clean and update the rules. • Race conditions could arise when updating rules, try to analyze the possible corner scenarios before you start coding.
2.6 Testing Service Mininet allows us to define the code that is going to be used on each of the hosts. To be able to test the previous code correctly, you are going to implement a role and a host type similar to the ones found in the second workshop (more in Useful References). You should design your own testing plan as you seem fit, but the testing service should have at least the following functionality: • The test service should be able to read a config file that containing multiple test cases of host pairs. • The config file should contain the pair of hosts involve, the bandwidth to be consumed and the time periods in which the communication is to happen. An example could be (a single fragment of a more complex test) • Multiple Iperf flows should be able to run simultaneously throughout the network (ie a flow from h1 to h2 and another flow from h3 to h4 should be able to be performed at the same time). An easy way to do this is with python’s threading or multiprocessing libraries. 1
{ server: "h1", client: "h2", test: [ {begin: "0", end: "1", bandwidth: 100}, {begin: "2", end: "3", bandwidth: 100}, ... ]
2 3 4 5 6 7 8 9
}
6
The pair of hosts should perform the test defined in the config file. There is no need to synchronize different hosts such that the timestamps match when running, so you should plan the testing such that tight coupling in the execution time of code running on different hosts is not require.
3 Deliverables 3.1 Code Include the code for each of the previous sections. It should contain a README that explains how to install all the required dependencies in the class Virtual Machine.
3.2 Test Cases Include all the test cases used to test each module and the commands required to run them.
3.3 Written Report The written report needs to include the following components: • Compare the latency and bandwidth perceived by using the widest-path and shortest-path rules. Create different topologies that can benefit of each of them. You need to deliver a graphical representation of the topology and the evaluations associated with them. • Modify T1, S1, and T2 and evaluate the effects of the proactive rules. Test it with the same topologies used for the widest-path static rules, and explain how the rules affect the perceived latency and bandwidths for the hosts. Use your testing service to perform the tests. Additionally, show with graphs, the loads perceived by the switches. • Explain the race condition scenarios that can arise when applying proactive rules. • Explain what are the benefits and drawbacks of the proactive approach. • Explain how you can improve the proactive approach. Requirements Test and evaluations performed should be meaningful, e.g. they should show the capabilities or drawbacks of a specific solution. Suggestions
7
• Topologies should use non-uniform bandwidths and/or latencies. • Many test cases are not required. Focus on a few of them, but explain the implication of the parameters selected. • Log the monitoring information in an easy to parse format, such that you can auto generate all the required graphs using python or a bash script.
3.4 Demo • Show an in-class demo to the TA of your four different controller designs (shortest path, widest path, proactive, proactive with redistribution). For each design: – Show the TA a graph of the topology that you are going to use to demonstrate your controller. The topology you present to the TA should clearly show its links and weights for latency and bandwidth. Using the same topology for different controller designs is fine, but the topology must be complicated enough to demonstrate your controller’s functionality. – Explain what flows your testing service is going to generate and where they should be installed on the topology. Explain how these flows will demonstrate what your controller design is supposed to do. – Run your testing service with your controller design to show that the flows are installed where you expect them to be. • The order of controller designs to be presented should be: 1. Shortest Path 2. Widest Path 3. Proactive 4. Proactive with Redistribution • Your Proactive and Proactive with Redistribution demos should show when rules are changed based on new incoming flows and after the end of flows.
4 Frequency Asked Questions 1. Do I have to use igraph? For creating graphs to track your network, we recommended using python igraph in the documentation, but using python’s networkx works fine too: Igraph guide networkx tutorial 2. When I do a link up or link down command, the ryu event handlers cannot capture them, what’s wrong?
8
Be sure to add the –observe-links flag when running your ryu controller, otherwise you will not be able to capture link related events in ryu. 3. How do I discover host information in ryu? You may need to get information about hosts in ryu. This can be done using ARP and the mact o p or t t abl e, but t hi sc anal sobed onei not her w a y s :
a) Using Ryu Topology’s EventHostAdd event: https://github.com/faucetsdn/ryu/blob/master/ryu/
b) Using Ryu Topology’s geth ost api : ht t ps : //g i t hub.com/ f aucet sd n/r yu/bl ob/mast er /r yu/t opol og y/ap 4. How do I handle threads for redistribution every T2 seconds or monitoring every T1 seconds? Ryu provides functions for creating threads that works well with the ryu controller in their hub module, specifically the spawn and spawna f t er f unc t i ons : https://github.com/faucetsdn/ryu/blob/master/ryu/lib/hub.py Be careful of race conditions of shared data structures. 5. What does “shortest path” mean between two hosts, the path with lowest latency or the path with least hops? We are looking for lowest latency as opposed to path with least hops. igraph has some good API for this, and networkx does as well. https://igraph.org/r/doc/distances.html 6) For the shortest path implementation, can I only just run the algorithm to install path for a particular flow once? This depends on your implementation. If you have discovered the entire topology, hosts, links, etc upon initialization of the network, when a new flow is added, you only need to run the shortest path algorithm once. If your topology discovery is incremental, or lags behind requests between hosts, you may need to re run your shortest path algorithm on existing flows. Note that we are not looking for any redistribution outlined in section 2.5.1 of the pdf for the shortest path implementation. The flow rules in the shortest path implementation should be mostly static. 7) What is the relationship between comml i st , number o f f l ow sand S1par amet er ? For each flow in the network, you will maintain a list of size S1 to track the number of bytes transmitted in the flow in some data structure, lets call this flowm oni t or.
In your monitoring section, every T1 seconds, you will update flowm oni t or f or ever y f l owi nt henet wor kw flowm oni t or t r ackst hes1l at est b y t est r ansmi t t ed f or each f l ow.
The comml i st d at ast r uct ur et r ackst heaver ag eband wi d t hused f or each f l ow, and wi l l usei n f or mat i on
9
Every T2 seconds, you will need to use the average bandwidth used for each flow information in comml i st t od ot her ed i st r i but i onal g or i t hm. 8) What if my host is unreachable when I’m redistributing? When redistributing, whether your host went down, or there is no available bandwidth left, or a link in the middle went down making it impossible for 2 hosts to communicate, do not update the flows with any new rules. You can choose to delete old flow rules if that fits your implementation. You can also make an assumption that during redistribution, the topology will not change, and any topology changes during this redistribution attempt will be handled the next time redistribution is run. Either way, this depends on your implementation and design choices.
10
System Issues in Cloud Computing Mini-course: Network Function Virtualization KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
Headshot “Welcome to this mini-course on NFV. You have already been through the first three legs of this journey in which we covered SDN, Systems issues, and Application issues in Cloud computing. Now we take you through the 4th leg of this journey. Simply put the purpose of Network Function Virtualization is to decouple the network services needed for any enterprise (such as firewall and malware inspection) from proprietary hardware appliances often referred to as “Middleboxes” so that such services can be run as software entities on standardized servers using IT virtualization technologies. Thus the network services become virtual network functions. As has been true with the earlier legs of the journey, there are handson workshops designed to go with the lectures plus a project to tie
4 Modules Software Defined Networking (SDN)
Cloud Applications (APP)
=>
Organization of the mini course
• Lecture 1: introduction to types of network functions and the pathway to virtualizing them • Lecture 2: A concrete example of a technology enabler for virtualizing network functions • Lecture 3: Synergy between SDN and
Lecture 1: Introduction to Network Functions
Outline ● ● ● ● ●
What are network functions? Middleboxes for realizing network functions as standalone services Network management and proliferation of middleboxes Network services as software entities Virtualization technology for hosting network services as software entities
What are network functions? ●
Firewall ○ ○
●
Intrusion detection/prevention ○ ○
●
Reduce WAN bandwidth consumption of an enterprise Perform multiple techniques like caching, traffic compression, etc. for reducing traffic and latency
Load balancer ○
●
Translates private IP address space to public IP address space and vice versa Useful for organizations that have limited a public IP network presence
WAN Optimizers ○ ○
●
Perform more complicated analysis of packet traffic Identify complex patterns of network traffic belonging to an attack/suspicious activity
Network Address Translation (NAT) ○ ○
●
Filters traffic based on pre-defined rules. Rules are simple since filtering is in the critical path of packet flow
Distribute traffic to a pool of backend services
Virtual Private Network (VPN) Gateway ○ ○
Provides abstraction of same IP address space for networks that are physically separate Multiple sites communicate over WAN using tunnels between gateways
Why do enterprises need these network functions? User’s view of an enterprise
goog le ebay
amazo n
Why do enterprises need these network functions? Internal view of an enterprise computing environment ● Clusters of machines serving many internal functions ○
○
Sales, marketing, inventory, purchasing, etc. Employees accessing them on-premises, and remotely
Why do enterprises need these network functions? Region Internal view of an enterprise computing environment ● Clusters of machines serving many internal functions ○
○
●
Sales, marketing, inventory, purchasing, etc. Employees accessing them on-premises, and remotely
Enterprises may have
al office
interne t Region al office Head office
Why do enterprises need these network functions? Internal view of an enterprise computing environment ● Clusters of machines serving many internal functions ○
○
●
Sales, marketing, inventory, purchasing, etc. Employees accessing them on-premises, and remotely
Enterprises may have
Microsoft
interne t Intel
Samsung
Why do enterprises need these network functions?
● Network functions give the necessary safeguards and facilities for enterprises ○ ○ ○ ○ ○
Intrusion Prevention: Performs inspection of packet payload to identify suspicious traffic Firewall: Filters packets based on their src, dst IPs, ports and protocol Load balancer: Evenly distributes incoming connections to one of the backend servers WAN Accelerator: Reduces WAN bandwidth consumption by data deduplication and compression VPN: Provides illusion of same network address space across multiple sites. Provides encryption for inter-site traffic.
Outline ● ● ● ● ●
What are network functions? Middleboxes for realizing network functions as standalone services Network management and proliferation of middleboxes Network services as software entities Virtualization technology for hosting network services as software entities
Middleboxes ● Standalone hardware boxes (aka network appliances) providing specific network functions (e.g., firewall)
Middleboxes ● Standalone hardware boxes (aka network appliances) providing specific network functions (e.g., firewall) ● Example of Middleboxes deployed in an enterprise
Middleboxes ● Standalone hardware boxes (aka network appliances) providing specific network functions (e.g., firewall) ● Example of Middleboxes deployed in an enterprise
Consider the example of a retail organization (like Walmart) that holds inventory information on premises, but uses an enterprise datacenter for long-running batch processing (demand prediction, etc.) End-clients communicate with on-premise application Needs to scale horizontally to handle peak traffic => need for load balancer Limit ports for traffic => need for firewall Detect/prevent suspicious activity => need for Intrusion Prevention Communication with enterprise datacenter Need for VPN for encryption of traffic and illusion of continuous IP address space Need for WAN accelerator to reduce WAN bandwidth usage (reduce $$) Office personnel access content on the Internet
Middleboxes (or network appliances) ● Computer networking devices that analyze/modify packets ○
For purposes other than packet forwarding
● Typically implemented as specialized hardware components
An Example: Intrusion Prevention System (IPS) ● ●
Security appliance Monitors all open connections to detect and block suspicious traffic ● Sysadmin configures signatures in IPS box to detect suspicious traffic ● Can work in inline mode (can filter out suspicious traffic) or passive mode (analyzes packets outside critical/data path) Table on the right shows the various traffic signatures that Cisco’s IPS are pre-configured with. The system admin can select all or any of these signatures to be searched for in packet traffic This particular screenshot is for a search result for “botnet” : showing 10 signatures that characterize botnet attack traffic Source: https://tools.cisco.com/security/center/ipshome.x
Cisco IPS 4240 Sensor (source : https://www.cisco.com/c/en/us/support/security/ip s-4240-sensor/model.html)
Another Example: HTTP Proxy ● Performance-improving appliance ● Caches web content to reduce page-load time ● Reduces bandwidth consumption ● Can filter out blocked websites
Cisco Web Security Appliance S170 https://www.cisco.com/c/en/us/support/security/web-securityappliance-s170/model.html
Middleboxes in core cellular networks 1. Serving Gateway (S-GW) a. Responsible for routing/forwarding of packets b. Executes handoff between neighbouring base stations
2. Packet Gateway (P-GW) a. Acts as interface between cellular network and Internet b. NAT between internal IP subnet and Internet c. Traffic shaping
3. Mobility Management Entity (MME) a. Key control node of LTE b. Performs selection of S-GW and P-GW c. Sets up connection when device is roaming
4. Home Subscriber Server a. User identification and addressing using IMSI number b. User profile info: service subscription rates and QoS https://ccronline.sigcomm.org/wp-content/uploads/2019/05/acmdl19-289.pdf
How are middleboxes different from router/switch? Middleboxes are stateful
● Packet processing is dependent on fine-grained state ● Updated frequently (per packet/per flow/per connection) Middleboxes perform complex and varied operations on packets
Outline ● ● ● ● ●
What are network functions? Middleboxes for realizing network functions as standalone services Network management and proliferation of middleboxes Network services as software entities Virtualization technology for hosting network services as software entities
Network management and proliferation of middleboxes
Function Security Performanceenhancement Cross-protocol interoperability Billing and usage monitoring
Middleboxes Firewall, IDS, IPS, …. HTTP proxy, WAN accelerator, Load balancer, … NAT (IPv6 IPv4), VPN, … …
Network management and proliferation of middleboxes
● Similar challenges that motivated shift of IT to cloud services ● Leads to lock-in to the hardware vendor of each specific middlebox ○
Difficult and expensive to migrate to a different solution
● Failures of middleboxes lead to network outages ● High Capital and Operational expenditure ○ ○
Provisioning is done based on peak capacity Management/maintenance cost is high
Outline ● ● ● ● ●
What are network functions? Middleboxes for realizing network functions as standalone services Network management and proliferation of middleboxes Network services as software entities Virtualization technology for hosting network services as software entities
Network functions as software entitites on COTS servers ● Replace middleboxes by software entities ● Run such network functions as an “application” on general-purpose servers ● Benefits ○ ○ ○ ○ ○
Low cost of deployment Better resource utilization Scaling is easily possible: lower CapEx Can switch between vendors easily Failures are easier to deal with
Examples of “software” middleboxes ● Linux iptables: provides NAT and Firewall ● SoftEther VPN ● Squid HTTP proxy ● nginx load balancers ● Bro Intrusion Detection System (circa 1999)
Fundamental components of software middleboxes
● Use Unix sockets → opening a socket creates a file descriptor ● Use system calls read() and write() calls to Linux kernel for reading and writing to a socket ○
Raw Linux sockets enable developer to read/write raw bytes (MAC layer data) from/to NIC
Architecture of a load balancer network function ● Distribute client connections to a pool of backend service instances ○
For example HTTP Server
● Use packet’s 5-tuple to choose backend instance ○ ○
Provides connection-level affinity Same connection is sent to same backend instance
Packet 5- Backend Service tuple Instance
Update & Lookup Load Balance r
Backend Instance 0 Backend Instance 1 Backend Instance 2 Backend Pool
Architecture of a load balancer network function Packet 5- Backend Service tuple Instance
Read packe t
Extract connection info from header
Lookup connection info in table If match found
recvfrom() system call for reading packets into buffer
Select backend instance and add to table If match not found
Send packet to backend instance sendto() call to send packet out
What happens when a packet arrives ? 1. NIC uses Direct Memory Access to write incoming packet to memory 2. NIC generates an interrupt 3. CPU handles the interrupt, allocates kernel buffer and copies DMA’d packet into buffer for IP and TCP processing 4. After protocol processing, packet payload is copied to application buffer (user-space)
Outline ● ● ● ● ●
What are network functions? Middleboxes for realizing network functions as standalone services Network management and proliferation of middleboxes Network services as software entities Virtualization technology for hosting network services as software entities
Why virtualization for NF ? Using a VM for hosting a NF (instead of running NF on bare metal servers) ● Better portability because entire environment can be deployed ○
All dependencies are inside VM image
● Network management becomes easier ● Each NF instance is shielded from software faults from other network services
How to virtualize? ● Traditionally two approaches ○ ○
Full virtualization Para virtualization
● Full virtualization is attractive since the VM on top of hypervisor can run un modified ○ ○
“trap-and-emulate” technique in the hypervisor to carry out privileged operations of the VM which is running in user mode Unfortunately, for network functions that are in the critical path of packet processing this is bad news…
How “Trap-and-emulate” works ● I/O is performed via system calls ● When guest VM performs I/O operation ○ ○ ○
Executes system call Guest kernel is context switched in Privileged instructions are invoked for reading/writing to I/O device
● But Guest kernel is actually running in userspace !! ○ ○
Guest VM is a user-space program from the host’s perspective Execution of privileged instruction by user-space program results in a trap
● Trap is caught by the hypervisor ○ ○
Performs the I/O on behalf of the Guest VM Notifies the Guest VM after I/O operation finishes
Downsides of “Trap-and-emulate” for NF ● Host kernel (e.g., Dom-0 in Xen) has to be context switched in by the hypervisor to activate the network device driver and access the hardware NIC ● Duplication of work by the virtual device driver in the Guest and the actual device driver in the Host ● NF incurs the above overheads ○ ○
For each packet that is sent to the NIC For each packet received from the NIC
● NF is in the critical path of network processing and such overheads are untenable
Eliminating the Overhead of Virtualization for NF ● Fortunately, hardware vendors have been paying attention ● We will mention two approaches to eliminate I/O virtualization overheads ○ ○
Intel VT-d Intel SR-IOV
Enabling technologies for virtualized NFs 1) Intel® Virtualization Technology for Directed I/O (VT-d) ● Allows efficient access to host I/O devices (e.g., NIC) ● Avoids overheads of trap-emulate for every I/O access ● Allows remapping of DMA regions to guest physical memory ● Allows interrupt remapping to guest’s interrupt handlers ● Effectively direct access for guest machine to I/O
Benefits of VT-d ● Avoid overheads of trap-end-emulate ● DMA by NIC is performed to/from memory belonging to Guest VM’s buffers ● Interrupts are handled directly by the Guest instead of hypervisor ● Effectively, the NIC is owned by the Guest VM
Enabling technologies for virtualized NFs (contd.) 2) Single Root I/O Virtualization (SR-IOV) interface
● An extension to the PCIe specification ● Each PCIe device (Physical Function) is presented as a collection of Virtual Functions ● Practical deployments have 64 VFs per PF ● Each Virtual Function can be assigned to a VM ● Allows higher multi-tenancy and performance isolation
Benefits of SR-IOV Allows same physical NIC to be shared by multiple Guest VMs without conflicts
Putting it all together ● Virtual Network Function implementation ○ ○ ○ ○
○ ○
Host machine (NICs, etc.) SR-IOV VT-d direct access Virtual Machine with DPDK driver ○ DPDK will be covered in the next lecture NFs implemented as a User-space application running inside VM DMA from SR-IOV VF directly into VM buffers
Closing headshot “In this lecture, we saw the important role played by network functions in dealing with the vagaries of the wide-area Internet and the dynamics of the evolving needs of an enterprise. Naturally, businesses saw an opportunity to wrap such functions in special-purpose hardware boxes which came to be referred to as middleboxes since they sat between the enterprise computing and the wide-area Internet. The proliferation of middleboxes and the ensuring network management nightmare has rightly turned the attention towards realizing these network functions as software entities. To make sure that the software entities can run in a platform agnostic manner, it makes sense to have the network functions execute on top of a virtualization layer. We ended the lecture with a look at example technologies from vendors
Credits for Figures Used in this Presentation ● ● ● ● ● ● ● ● ● ● ● ●
https://www.howtogeek.com/144269/htg-explains-what-firewalls-actually-do/ http://ecomputernotes.com/computernetworkingnotes/security/virtual-private-network https://avinetworks.com/glossary/hardware-load-balancer/ https://blogs.it.ox.ac.uk/networks/2014/06/05/linuxs-role-in-the-new-eduroam-infrastructure/ https://slideplayer.com/slide/10419575/ https://www.cisco.com/c/en/us/support/security/ips-4240-sensor/model.html https://www.cisco.com/c/en/us/support/security/web-security-appliance-s170/model.html http://www.artizanetworks.com/resources/tutorials/sae_tec.html https://portal.etsi.org/NFV/NFV_White_Paper.pdf https://myaut.github.io/dtrace-stap-book/kernel/net.html https://software.intel.com/en-us/articles/intel-virtualization-technology-for-directed-io-vt-d-enhancing-intel-platforms-for-efficient-virtualization-of-io-devices https://www.intel.sg/content/dam/doc/application-note/pci-sig-sr-iov-primer-sr-iov-technology-paper.pdf
●
https://upload.wikimedia.org/wikipedia/commons/thumb/6/65/CPT-NAT-1.svg/660px-CPT-NAT-1.svg.png
Resources ●
Network function virtualization: through the looking-glass https://link.springer.com/article/10.1007/s12243-016-0540-9
●
http://www.cs.princeton.edu/courses/archive/spr11/cos461/docs/lec11-middleboxes.pdf
● ●
https://portal.etsi.org/NFV/NFV_White_Paper.pdf Comparison of Frameworks for High-Performance Packet IO https://www.net.in.tum.de/publications/papers/gallenmueller_ancs2015.pdf VT-d : https://software.intel.com/en-us/articles/intel-virtualization-technology-for-directed-io-vt-d-enhancing-intelplatforms-for-efficient-virtualization-of-io-devices VT-d : https://www.net.in.tum.de/fileadmin/bibtex/publications/papers/ixy_paper_short_draft1.pdf SRIOV : https://www.intel.com/content/dam/www/public/us/en/documents/technology-briefs/sr-iov-nfv-tech-brief.pdf SRIOV : https://docs.microsoft.com/en-us/windows-hardware/drivers/network/sr-iov-architecture http://yuba.stanford.edu/~huangty/sigcomm15_preview/mbpreview.pdf https://www.iab.org/wp-content/IAB-uploads/2014/12/semi2015_edeline.pdf
● ● ● ● ● ●
System Issues in Cloud Computing Mini-course: Network Function Virtualization KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
Lecture 2 - Developing Virtual Network Functions
Opening headshot In the first lecture, we introduced the role played by network functions in the enterprise computing ecosystem. We also identified the need for liberating network functions from vendor-locked in middleboxes and implement them as software entities running on commodity servers. Further, to make such functions portable across platforms, we discussed that virtualizing the network functions so that they can be run on hypervisors is the right approach. In this lecture, we will take an in-depth look at virtual network functions. In particular, we will discuss the issues in developing virtual network functions and the emerging technologies for aiding the performance-conscious development of virtual network functions.
Outline ● Virtual network functions (VNF): revisiting “load balancer” as a concrete example ● Performance issues in implementing virtual network functions ● Performance-conscious implementation of virtual network functions ● Data Plane Development Kit (DPDK) - an exemplar for user-space packet processing ● Implementation of VNF on commodity hardware using DPDK ● Putting it together: load balancer example using DPDK
Virtual Network Functions ● Network functions implemented in user-space on top of hypervisor ● Load balancer as a concrete example ○ ○
Keeps a pool of backend service instances (e.g., HTTP Server) Distributes incoming packet flows to a specific instance to exploit inherent parallelism in the hardware platform and balance the load across all the service instances.
Architecture of a load balancer network function ● Distribute client connections to a pool of backend service instances ○
For example HTTP Server
● Use packet’s 5-tuple to choose backend instance ○ ○
Provides connection-level affinity Same connection is sent to same backend instance
Packet 5- Backend Service tuple Instance
Update & Lookup Load Balance r
Backend Instance 0 Backend Instance 1 Backend Instance 2 Backend Pool
Architecture of a load balancer network function Packet 5- Backend Service tuple Instance
Read packe t
Extract connection info from header
Lookup connection info in table If match found
recvfrom() system call for reading packets into buffer
Select backend instance and add to table If match not found
Send packet to backend instance sendto() call to send packet out Userspace Kernel space
Outline ● Virtual network functions (VNF): revisiting “load balancer” as a concrete example ● Performance issues in implementing virtual network functions ● Performance-conscious implementation of virtual network functions ● Data Plane Development Kit (DPDK) - an exemplar for user-space packet processing ● Implementation of VNF on commodity hardware using DPDK ● Putting it together: load balancer example using DPDK
Eliminating the overhead of virtualization ● Network function is in the critical path of packet processing ● Need to eliminate the overhead of virtualization ○
Intel VT-d allows the NIC to bypass the VMM (i.e., the hypervisor) by direct mapping user-space buffers for DMA and passing the device interrupt directly to the VM above the VMM
● Is that enough? ○ ○
Unfortunately no… To fully appreciate why ■ let’s look at the path of packet processing in an OS like Linux...
Packet Processing in Linux ● NIC uses DMA to write incoming packet to a receive ring buffer allocated to the NIC ● NIC generates an interrupt which is delivered to the OS by the CPU ● OS handles the interrupt, allocates kernel buffer and copies DMA’d packet into the kernel buffer for IP and TCP processing ● After protocol processing, packet payload is copied to application buffer (user-space) for processing by the application
An example networking app on Linux Kernel ● A web server on Linux ● 83% of the CPU time spent in the kernel ● This is not good news even for a networking app such as a web server ● This is REALLY BAD news if the app is a network function...
Source: mTCP: A Highly Scalable User-level TCP Stack for Multicore Systems https://www.usenix.org/node/179774
Network Functions on Linux kernel ● Performance hits ○ ○ ○ ○ ○
One interrupt for each incoming packet Dynamic memory allocation (packet buffer) on a per packet basis Interrupt service time Context switch to kernel and then to the application implementing the NF Copying packets multiple times ■ From DMA buffer to kernel buffer ■ Kernel buffer to user-space application buffer ■ Note that not a NF may or may not need TCP/IP protocol stack traversal in the kernel depending on its functionality
Outline ● Virtual network functions (VNF): revisiting “load balancer” as a concrete example ● Performance issues in implementing virtual network functions ● Performance-conscious implementation of virtual network functions ● Data Plane Development Kit (DPDK) - an exemplar for user-space packet processing ● Implementation of VNF on commodity hardware using DPDK ● Putting it together: load balancer example using DPDK
Circling back to Virtualizing Network Functions... ● Intel VT-d provides the means to bypass the hypervisor and go directly to the VM (i.e, the Guest Kernel which is usually Linux) ○
NF is the app on top of Linux
● Guest kernel presents a new bottleneck ○ ○
Specifically for NF applications Sole purpose of NF applications is to read/write packets from/to NICs
● Slowdown due to kernel is prohibitive ○
○
Dennard scaling broke down in 2006 => CPU clock frequencies are not increasing significantly from generation to generation => CPU speeds are not keeping up with network speeds NICs can handle more packets per second => increasing pressure on CPU
● So it is not sufficient to bypass the VMM for NF virtualization ○
We have to bypass the kernel as well...
Performance-Conscious Packet Processing Alternatives
By-passing the Linux kernel ● Netmap, PF_RING ZC, and Linux Foundation DPDK These alternatives possess common features ● Rely on polling to read packets instead of interrupts ● Pre-allocate buffers for packets ● Zero-copy packet processing ○ NIC uses DMA to write packets into pre-allocated application buffers ● Process packets in batches
Outline ● Virtual network functions (VNF): revisiting “load balancer” as a concrete example ● Performance issues in implementing virtual network functions ● Performance-conscious implementation of virtual network functions ● Data Plane Development Kit (DPDK) - an exemplar for user-space packet processing ● Implementation of VNF on commodity hardware using DPDK ● Putting it together: load balancer example using DPDK
Data Plane Development Kit ● Developed by Intel in 2010 ○
Now an open source project under Linux Foundation (https://www.dpdk.org/)
● Libraries to accelerate packet processing ● Targets wide variety of CPU architectures ● User-space packet processing to avoid overheads of Linux Kernel
Features of DPDK ● Buffers for storing incoming and outgoing packets in userspace memory ○
Transmit buffer
N/w App
Directly accessed by the NIC DMA
● NIC configuration registers are mapped in user-space memory ○
Located in userspace memory
Receive buffer
PCIe configuration space https://en.wikipedia.org/wiki/PCI_configuration_space
○
Can be modified directly by userspace application
● Effectively bypasses the kernel for interacting with NIC
Configuration Registers
DPDK is a user-space library ● Very small component in the kernel ○
Used for initialization of userspace packet processing
○
https://doc.dpdk.org/guides/linux_gsg/linux_drivers .html
● Needed to initialize the NIC to DMA to appropriate memory locations ● Setup memory mapping for configuration registers on the NIC ○ ○
PCI Configuration space Updating those registers is then
Image source : https://www.slideshare.net/LiorBetzalel/introduction-to-dpdk -and-exploration-of-acceleration-techniques
Poll Mode Driver ● Allows accessing Receive (RX) and Transmit (TX) queues ● Interrupts on packet arrival are disabled ● CPU is always busy polling for packets even if there are no packets to be received ● Receive and Transmit in batches for efficiency
while (true) { buff ← bulk_receive(in_port) for pkt in buff : out_port ← look_up(pkt.header) # Handle failed lookup somehow out_buffs[out_port].append(pkt) for out_port in out_ports: bulk_transmit(out_buffs[out_port]) }
NIC Ring Buffer ● Each NIC queue is implemented as a ring buffer [Not specific to DPDK, https://www.linuxjournal.com/content/queueing-linux-network-stack] ● Each slot in the ring buffer holds a “descriptor” for a packet ○
○
Descriptor contains a pointer to the actual packet data ■ And other metadata Actual packet is stored in another buffer data structure
Read pointer (advanced when CPU reads packets)
Write pointer (advanced when NIC receives packets)
NIC Ring Buffer ● Upon packet arrival, NIC populates the next vacant slot with packet’s descriptor ● CPU core running NF polls ring for unread slots ● When new descriptors are found ○ ○
Read pointer (advanced when CPU reads packets)
CPU reads the packet data for those descriptors Returns packets to application
● No need for locking: producer and consumer are decoupled in ring buffer ● If no vacant descriptor slots in ring buffer, NIC drops packets
Write pointer (advanced when NIC receives packets)
Pre-allocated buffers for storing packets ● Instead of allocating a buffer for each incoming packet, DPDK preallocates multiple buffers on initialization ● Each Rx queue in the NIC can hold no more packets than the capacity of the ring buffer ○
Total size of packet buffers is thereby known = capacity of ring
… Each ring slot points to a preallocated buffer
…
Pre-allocated buffers for holding incoming packets
Pre-allocated buffers for storing packets ● Incoming packet is DMA’d into the buffer along with adding new packet descriptor to ring buffer ● DPDK uses hugepages to maintain large pools of memory ○ ○
Each page is 2 MB in size (compared to traditional 4KB pages) Fewer pages ⇒ Fewer TLB misses ⇒ improved performance
… Pre-allocated buffers for holding incoming packets
… Advance the ring write pointer
DMA packet data
No overhead of copying packet data ● NIC DMA transfers packets directly to userspace buffers ● Protocol processing (TCP/IP) is done using those buffered packets in place ○
…if needed by the network function [Note: not all NFs require TCP/IP processing
Upshot of NF using DPDK and Intelligent NICs ● All the kernel overheads in packet processing (alluded to earlier) mitigated/eliminated ● Results in performance-conscious implementation of the VNF ● Developer of NF can concentrate on just the functionality of the NF ○
DPDK alleviates all the packet processing overheads for any NF
Outline ● Virtual network functions (VNF): revisiting “load balancer” as a concrete example ● Performance issues in implementing virtual network functions ● Performance-conscious implementation of virtual network functions ● Data Plane Development Kit (DPDK) - an exemplar for user-space packet processing ● Implementation of VNF on commodity hardware using DPDK ● Putting it together: load balancer example using DPDK
DPDK optimizations Various optimization opportunities are available in DPDK to improve packet processing Each optimization attempts to eliminate a particular source of performance drop in Linux kernel and/or exploitation of hardware features in NICs and modern CPUs Now let’s look at using DPDK to implement NFs on commodity hardware
Implementing NFs using DPDK on commodity H/W ● Modern commodity servers contain multi-core CPUs ● Using multiple cores for packet processing can allow us to match the increasing capacities of NICs ● NUMA servers ○ ○
Multiple sockets each with given number of cores and local RAM Accessing remote RAM is much more expensive than local RAM
Upshot ● Need to carefully design the packet processing path from NIC to NF taking these hardware trends ● Partnership between system software and hardware
DPDK Application Model 1. Run-to-completion model ● Polling for incoming packets, processing on packet, and transmission of output packet all done by the same core ● Each packet is handled by a unique core 2. Pipelined model ● Dedicated cores for polling and processing packets ● Inter-core packet transfer using ring buffers
Run-to-completion model ● All cores responsible for both I/O and packet processing ● Simplest model ● Each packet sees only one core ○ Works for monolithic packet processing code ○ When all the packet processing logic is contained inside a single thread ○ Simple to implement but less expressive
Process packets
RX Queues Poll
Insert in Tx queues
TX Queue s
Pipelined execution model ● Dedicate cores for processing NF logic ● Some cores are dedicated for reading packets ● Each packet sees multiple cores ○ Can be used to chain multiple packet processing logics (within an NF!) ○ E.g., IN → Firewall → Router → OUT ● Inter-core communication done using queue buffers in memory ● Also useful when packet processing is CPU bound, so having number of polling cores < number of processing cores is a good choice ○ E.g., Intrusion Detection System
RX Queues
Poll
Process packets
TX Queues
Multi-core implementation challenges How to ensure that processing done by two distinct cores don’t interfere with each other? ● If different packets of same connection are sent to different cores, sharing NF state will be a nightmare. How to ensure that cores participating in inter-core communication are on the same NUMA node? How to ensure that the cores processing packets are on same NUMA socket as the NIC?
Receive side scaling: Useful hardware technology ● Enabler of multi-core processing ● Use hashing to distribute incoming packets to individual cores ○ ○
Hash function takes 5-tuple of packet as input src_ip, dst_ip, src_port, dst_port, proto
● Each core is assigned a unique ring buffer to poll ○
No contention among threads
● Different connection ⇒ Different queue (ring) ⇒ Different core ○
Hash Functio n
Per-connection state is accessed only by a single core, so state management is easy
Ring CPU buffers cores
Multi-core support in DPDK Allows admin to specify the following (hardware/software partnership): ● Allows mapping of specific RX queue to specific CPU core ○ Port 0 - Rx queue 1 → CPU core 6 ○ CPU core 6 → Port 1 - Tx queue 2 ○ Flexible to create as many queues as admin wants ● Each thread is pinned to a specific core ○ To avoid contention ● Each thread/core runs the same code
NUMA awareness in DPDK ● DPDK creates memory pools for inter-core communication on the same NUMA socket as the cores involved ● Ring buffers are allocated on the same socket as the NIC and cores selected for processing ● Remote memory access is minimized
Outline ● Virtual network functions (VNF): revisiting “load balancer” as a concrete example ● Performance issues in implementing virtual network functions ● Performance-conscious implementation of virtual network functions ● Data Plane Development Kit (DPDK) - an exemplar for user-space packet processing ● Implementation of VNF on commodity hardware using DPDK ● Putting it together: load balancer example using DPDK
Putting it all together: Load balancer application ● Multi-threaded load balancer ○ ○
Run-to-completion model Each thread performing identical processing
● Dedicated Rx and Tx queues for each core
Scalable implementation of Load Balancer on Multi-core CPU using DPDK Load Balance r Load Balance r
RSS
Load Balance r Load Balance r
Ring CPU buffers cores
Connectio n Tables
Packet 5- Backend Service tuple Instance
Each thread within a Core Reading packet data for new descriptors
Read packe t
Extract connection info from header
Polling for new packet descriptors
Select backend instance and add to table If match not found
Lookup connection info in table If match found
Send packet to backend instance
Write output packet
DMA from NIC
Advance write pointer
Advance write pointer DMA to NIC
Userspace Kernel space
Closing headshot In this lecture we saw how to build performance-conscious virtual network functions. The state-of-the-art is to implement network functions on top of Linux Kernel. We identified the source of performance bottleneck with this approach for network function implementation. We discussed the general trend towards accelerating packet processing via mechanisms to bypass the kernel. We used Intel’s DPDK as an exemplar to understand how the kernel overhead could be mitigated. Further, we also saw the optimization opportunities offered by DPDK for exploiting the inherent multi-core capabilities of modern processors. Due to the hardware/software partnership offered by technologies such as VT-d and DPDK, it is now possible to have very performance-conscious user-space implementation of virtual network functions.
Resources 1. 2.
Comparison of Frameworks for High-Performance Packet IO https://www.net.in.tum.de/publications/papers/gallenmueller_ancs2015.pdf mTCP: A Highly Scalable User-level TCP Stack for Multicore Systems https://www.usenix.org/node/179774
3. 4.
On Kernel-Bypass Networking and Programmable Packet Processing https://medium.com/@penberg/on-kernel-bypass-networking-and-programmable-packet-processing-799609b06898 Introduction to DPDK: Architecture and Principles https://blog.selectel.com/introduction-dpdk-architecture-principles/
5. 6.
DPDK architecture https://doc.dpdk.org/guides/prog_guide/overview.html A Look at Intel’s Dataplane Development Kit https://www.net.in.tum.de/fileadmin/TUM/NET/NET-2014-08-1/NET-2014-08-1_15.pdf
7.
Linux drivers https://doc.dpdk.org/guides/linux_gsg/linux_drivers.html
Credits for figures used in this presentation 1.
mTCP: A Highly Scalable User-level TCP Stack for Multicore Systems https://www.usenix.org/node/179774
2.
Introduction to DPDK and exploration of acceleration techniques https://www.slideshare.net/LiorBetzalel/introduction-to-dpdk-and-exploration-of-acceleration-techniques
3. 4. 5.
On Kernel-Bypass Networking and Programmable Packet Processing https://medium.com/@penberg/on-kernel-bypass-networking-and-programmable-packet-processing-799609b06898 Dynamic Tracing with DTrace & SystemTap : Network stack https://myaut.github.io/dtrace-stap-book/kernel/net.html Introduction to DPDK: Architecture and Principles https://blog.selectel.com/introduction-dpdk-architecture-principles/
System Issues in Cloud Computing Mini-course: Network Function Virtualization KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
Lesson 3 - System architecture for NFV ecosystem
Opening headshot “In the previous two lectures we discussed technologies used to implement an efficient virtual network function (VNF) - by removing overheads of guest kernel and hypervisor. However, very rarely does an enterprise use a single NF in isolation (which we discussed in Lecture 1 : proliferation of middleboxes). Until now the technologies we have discussed thus far provide no support for multiple NFs to work in tandem. In this lecture, we will see how provisioning computational resources for multiple VNFs via Cloud technologies and orchestrating their deployment using SDN addresses this issue.”
Outline ● Network Functions (NFs) need appropriate management tools (control plane) for fulfilling realistic user scenarios ● Basic components required for building a control plane for NFV ○ Dynamically allocating computational resources for hosting the NFs ■ Elastically scaling the instances of NFs to meet the SLAs for packet processing ○ Dynamically programming the network fabric for packet processing with the NFs
Transition slide Title: Limitations of monolithic software middleboxes
Monolithic software middleboxes are not enough ● NFV products and prototypes tend to be merely virtualized software implementations of products that were previously offered as dedicated hardware appliances. ● Simply virtualizing networking middleboxes using NFV replaces monolithic hardware with monolithic software. ○ Using previously discussed techniques like DPDK, VT-d and SR-IOV ● While this is a valuable first step – as it is expected to lower capital costs and deployment barriers – it fails to
Multiple network functions (NFs) need to be chained together 10.1.0.0/16, HTTP → * Ingress port
● Administrator wants to route all Firewall IDS HTTP traffic through Firewall → IDS → Proxy Ingress port 10.1.0.0/16, rest → * ● All remaining traffic should be Firewall IDS routed through Firewall → IDS ● A common way of achieving this IDS 1 Firewall chain is by manually configuring the routing policy ● Manually configuring this routing Src : 10.1.0.0/16 policy is complicated ● Upon node failure, routing needs to Proxy IDS 2 be manually updated
Egress port
Proxy
Egress port
Dst : *
Dynamic scaling of network functions ●(NFs) Workload on each NF can change dynamically ●
●
Need additional instances to balance the load ○ Make use of elastic nature of virtualized infrastructure ○ Launch new instances on the fly based on demand Stateful NFs require attention to affinity ○ Packets of a given flow should be processed by the same NF instance after scaling ○ Network functions like Stateful Intrusion Detection System require both directions of traffic => Session affinity ■ Both directions of load balancing should work in coordination ■ Makes the problem even more
Flow 1 Flow 2 Flow 3
IDS Before scaling
After scaling (old connections“stick” to old instance) Flow 2 Flow 3
IDS
Flow 4 Flow 5
IDS New NF instance
Dynamic scaling of network functions (contd.) ● Network operators often purchase a customized load balancer for each type of middlebox ○ Requires manual configuration of load balancers when scaling happens ○ Adds another middlebox (load balancer) to the mix, increasing complexity ● Each middlebox (e.g. Firewall) has its own vendor-specific scaling policy ○ When to trigger scale-in/out? ⇒ every vendor re-invents the wheel ○ How to make customized load balancer aware of scale-in/out events? ⇒ done manually
Transition slide title: Need for a NF control plane
NF misconfiguration is a major problem Examples of failures due to misconfiguration • Administrators need to train employees with new hardware in cases of hardware upgrades • Misconfiguring software after an upgrade • Misconfiguration can be due to incorrect IP addresses, incorrect All of these issues are faced by all middlebox vendors =>routing/load-balancing Need a unified way of configuration and scaling middleboxes configuration after scaling NF instances
Need a control plane for middlebox management Requirements of system admins: ● Deploy chains of NFs ● Dynamically scale them based on workload Infrastructure: Cluster of servers connected by switches
Ingress port
Egress port Firewall
IDS
Proxy
Need a control plane for middlebox management Requirements of system admins: ● Deploy chains of NFs ● Dynamically scale them based on workload Infrastructure: Cluster of servers connected by switches NFV Infrastructure Manager
Ingress port
Egress port Firewall
Firewall
IDS
IDS
Proxy
Proxy
Need a control plane for middlebox management Requirements of system admins: ● Deploy chains of NFs ● Dynamically scale them based on workload
Ingress port
Egress port Firewall
IDS
Proxy
SDN Controller (e.g. Ryu)
Infrastructure: Cluster of servers connected by switches NFV Manager for deployment of NFs on server cluster: e.g. OpenStack SDN to program switches for correct traffic forwarding to realize
Firewall
IDS
Proxy
Transition slide Title: Elements of NFV control plane
OpenStack ●
●
Cluster of servers converted to a Cloud-like IaaS service using the following services: ○ Nova (compute): mgmt (creation/deletion/migration) of virtual machines by communicating with the hypervisor on each host ○ Cinder (block storage): persistent storage for VMs. Handles lifecycle of each block device creation/attachment to VM/release ○ Neutron (network): IP address mgmt, DNS, DHCP, load balancing ○ Glance (image): manages VM images Allows configuration of the cloud environment through API calls to the specific service ○ Create and scale up/down virtual machine instances ○ Exposes monitoring metrics of running
Applications
APIs OpenStack Services Nova (Compute)
Cinder (Block)
Neutron (Network)
Glance (Image)
Host OS + Hypervisor (KVM/VMWare ESXi)
Hardware
NFV + SDN synergy ● SDN allows programming switches to implement custom traffic forwarding paths ○ Helps implement NF chaining with NFs deployed on an arbitrary set of servers ○ Taking topology info as input, SDN can setup forwarding rules for packets from upstream NF to downstream NF ● Switches are programmed using south-bound API of SDN controllers ○ Switches need to be compliant with OpenFlow ○ SDN programs a switch by specifying which port a packet should be forwarded to based on its packet headers ○ Switch can also be programmed to modify packet headers, e.g., change destination MAC address or add VLAN Tag
Transition slide Title: NFV control plane architecture
NFV Control Plane ● NFV Infrastructure Manager handles scaling of functions ● SDN controller manages traffic forwarding ● Scaling and forwarding decisions are made by cross-layer orchestrator ● Topology information is needed for deciding traffic forwarding rules
User User User policies policies policies
Cross-layer orchestrator Scaling decisions
NFV Infrastructure Manager
North-bound SDN API
RYU SDN Controller Monitoring traffic Commands
NFV Control Plane
Cross-layer orchestrator
NF instantiation
NFV Infrastructure Manager
VM creation
Overload event
Topology monitoring
Traffic forwarding rules
SDN Controller
NF-specific utilization (hooks), queue sizes
… NF1
Server Agent
NF2
…
Hypervisor Host OS Rx queue
Tx queue
Switch status
NFn
Traffic forwarding rules
NFV Control Plane
Cross-layer orchestrator
NF instantiation
NFV Infrastructure Manager
VM creation
Overload event
Topology monitoring
Traffic forwarding rules
SDN Controller
NF-specific utilization (hooks), queue sizes
… NF1
Server Agent
NF2
…
Hypervisor Host OS Rx queue
Tx queue
Switch status
NFn
Traffic forwarding rules
NFV Control Plane • NFV Infrastructure Manager and SDN controller are the basic components needed to meet requirements of NFV control plane orchestration • These components are run in coordination by the cross-layer orchestrator. • Now we will discuss the main tasks of the orchestrator and how
Main concerns of cross-layer orchestrator Virtualization platform tasks ● How to place NFs across the compute cluster ? ● How to adapt placement decisions to changing workloads? (Scaling) Network programming tasks ● How to setup complex forwarding between NFs? ● How to ensure affinity constraints of NFs? ○ All packets of a given flow must be processed by the same NF instance ○ Packets in both directions of a connection should be processed by
Ingress port
Egress port
Firewall
Firewall
IDS
IDS
Proxy
Proxy
Virtualization platform tasks
Placement of network functions Ingress port
1. System admin specifies an NF chain ○ ○
Nodes represent NFs or physical ports (network interfaces) Edges represent data transfer between nodes. Each edge can be annotated with expected traffic rate
10.1.0.0/16, HTTP → * Firewall
IDS
Egress port Proxy
Placement of network functions Ingress port
1. System admin specifies an NF chain 2. Right Sizing Use initial estimates of expected traffic rate to determine load on an NF and per-core capacity to determine number of instances of each NF
Egress port Firewall
IDS
Proxy
NF chain provided by administrator
Right Sizing IDS Firewall IDS Firewall IDS
Proxy
Placement of network functions 1. System admin specifies an NF chain 2. Right Sizing 3. Optimizing the NF graph
IDS Firewall IDS
Proxy
Firewall IDS
Number of edges between stages can be minimized by taking affinity constraints into account For example : If both Firewall and IDS need flowaffinity, we can simplify the graph
IDS Firewall IDS Firewall IDS
Proxy
Placement of network functions IDS
1. System admin specifies an NF chain 2. Right Sizing 3. Optimizing the NF graph 4. Placement of instances on compute cluster Objective: Minimize inter-server traffic i. Intra-server software forwarding is faster ii. Inter-server link bandwidth is a limited resource Problem can be reduced to graph partitioning problem ⇒ NP-hard
Firewall IDS
Proxy
Firewall IDS
Iterative local search (graph partitioning heuristic)
Firewall
IDS
IDS Firewall
IDS
Server 1 Proxy
Server 2
Dynamic scaling of NF instances ● Hypervisor maintains a Server Agent to monitor system metrics ○ Queue sizes ○ Provide hooks to NFs to report instantaneous load-related metrics E.g. number of active flows (load balancer), cache hits/misses (proxy)1
● Uses metrics to determine overload of a particular network function ● Cross-layer orchestrator uses NFV Infrastructure manager to instantiate new instance of NF (i.e., “scaleout”) ● SDN Controller is used to redirect traffic to the new NF instance
NF-specific utilization (hooks), queue sizes
… NF1
Server Agent
NF2
…
Hypervisor Host OS Rx queue
Tx queue
NFn
Network programming tasks
How to perform unambiguous forwarding ? ● How to identify whether a packet has traversed only Firewall, or both Firewall and IDS ? ● This determines the destination of that packet
Ingress port
10.1.0.0/16, HTTP → * Firewall
Firewall
Ingress port
Port 2
IDS
Proxy
Proxy
IDS
Egress port
Port 3
Port 1 Switch S1
Egress port
Switch S2
Switch S3
Identify stage of NF chain that a packet belongs to ● Use TCP/IP headers to identify the NF chain that the packet is part of ● To determine which segment of the chain that this packet is part of ○
○
Use topology information: e.g., packet arriving on S1 on Port 2 has traversed Firewall and destined to IDS Tag packets with flow state information: e.g., on packets exiting Firewall and those exiting Proxy enter S3 through port 1 ■ Can embed chain segment information using dest MAC addr, VLAN tags, MPLS tags or unused fields in IP header ■ Tag each packet of a flow on very
10.1.0.0/16, HTTP → *
Ingress port
Egress port Firewall
Firewall
Ingress port
IDS
Proxy
Proxy
IDS
PROXY Port 2
Egress port
Port 3
Port 1 Switch S1
Switch S2
FW
S3
Forward packets to the correct downstream instance ● Multiple instances of same network function exist for same chain ● Use a high-level policy to determine which NF instance should process this flow/connection (based on affinity) ○ Example 1: Round-robin (uniformly distribute flows/connections between NF instances) ○ Example 2: Weighted roundrobin (distribute
Ingress port
Egress port Firewall
Firewall
IDS
Proxy
Switch S1
Proxy
IDS
Switch S2
IDS
Switch S3
Satisfying affinity requirements of NFs ● Flow-level affinity ○ Requirement: Direct all packets of a particular flow to a particular NF instance ○ Solution: Use OpenFlow packet header matching to identify flow and forward packets ● Connection-level affinity ○ Requirement: Direct all packets of traffic in both directions to particular NF instance ○ Solution: Maintain flow to NF instance mapping in Orchestrator. For the backward direction flow, the mapping to select the same instance that was used for forward direction flow
Transition slide Title: NFs that modify packets
Handle NFs that modify packet headers and payload Many NFs actively modify traffic headers and contents. For example, NATs rewrite the IP addresses of individual packets to map internal and public IPs. Other NFs such as WAN optimizers may spawn new connections and tunnel traffic over persistent connections.
NAT
IDS
Host H1
Internet Host H2 Goal: Traffic to & from H1 and H3 (green) needs to be processed by IDS Host H3
Possible soln: Instrument NFs to expose header/payload transformations Ideally, we would like fine-grained visibility into the processing logic and internal state of each NF to account for such transformations. Long-term solution is to have NFs expose such transformation information to cross-layer orchestrator (e.g., mapping between internal and external IP addresses for NAT) Limitation: Vast array of middleboxes and middlebox vendor + proprietary nature of middlebox functionality ⇒ achieving standardized APIs and requiring vendors to expose internal states does not appear to be a viable near-term solution
Types of packet modification by middleboxes 1. No packet modification a. Firewall, IDS, IPS b. TCP/IP headers match exactly
2. Header modification a. NAT, Load balancer b. Packet payload matches exactly c. Packet transformation occurs at the timescale of each flow
3. Payload modification a. WAN optimization, HTTP proxy (change HTTP header) b. High correlation in packet payload c. Packet transformation occurs at the timescale of each session
4. Complex modifications a. Encrypted VPN, compression
Flow correlation ● Packet collection at SDN Controller ○ ○
For each new flow coming into the middlebox, collect first P packets Collect first P packets for a time window W for all flows going out of the middlebox
SDN Controller Packet collection
Flow correlation
● Calculate payload similarity ○ ○
○
NAT
For each (ingress,egress) pair of flows compute a similarity score Use Rabin fingerprint to hash chunks of packet stream to detect overlapping content
● Identify most similar flow Identify egress flow with highest similarity score with the ingress flow No modification to middleboxes required But flow correlation is not 100% accurate Does not support complex packet
Rule installation
Forwarding rule
p3’ p2’ Packets from H1
p3
p2 p1
q3
q2 q1
Packets from H2
q3’ q2’
T milliseconds
p1’ q1’
In the example above, P=2 and W=T
FlowTags Allow middleboxes to “tag” packets based on the internal context ● Middleboxes generate tags for each flow based on the processing context ● Use the SDN controller to keep mapping from Tag to original IP header (in figure on right, only src_ip is assumed to be part of
NFV Orchestrator GENERATE_TAG 1 ⇒ {src_ip=H1} GENERATE_TAG 2 ⇒ {src_ip=H2} NAT
Forwarding rules
IDS
1
p3’
1
2
p2’ q3’ 2 q2’
Packets from H1
p3
p2 p1
q3
q2 q1
Packets from H2
1
2
p1’ q1’
Internet TAG FWD to 1 IDS 2 Internet
Requires middlebox modification to follow FlowTags API
FlowTags Allow middleboxes to “tag” packets based on the internal context ● Middleboxes generate tags for each flow based on the processing context ● Use the SDN controller to keep mapping from Tag to original IP header (in figure on right, only src_ip is assumed to be part of
NFV Orchestrator GENERATE_TAG 1 ⇒ {src_ip=H1}
CONSUME_TAG 1 ⇒ {src_ip=H1}
GENERATE_TAG 2 ⇒ {src_ip=H2} NAT
Forwarding rules
IDS
1
p3’
1
2
p2’ q3’ 2 q2’
Packets from H1
p3
p2 p1
q3
q2 q1
Packets from H2
1
2
p1’ q1’
1
p1’
1
p2’
1
p3’ Internet
TAG FWD to 1 IDS 2 Internet
Transition slide Title: Putting them all together
NFV Control Plane: Putting them altogether
Cross-layer orchestrator
NF instantiation
Overload
NFV Infrastructure Manager
VM creation
event
Topology monitoring
FlowTags API
Traffic forwarding rules
SDN Controller
NF-specific utilization (hooks), queue sizes
… NF1
Server Agent
NF2
…
Hypervisor Host OS Rx queue
Tx queue
NFn
Switch status/ new flows
Traffic forwarding rules
NFV Control Plane: Putting them altogether ● End-to-end system providing scalable NF chains has 2 components ○ Virtual Infrastructure Manager: E.g. OpenStack ■ Virtualizes a cluster of servers ■ Provides horizontal (scale-in/out) elasticity ○ SDN-controlled network ■ Allows programming network datapath for complex routing policies ○ Combine functionalities using cross-layer orchestrator ■ Efficient placement of NF instances ■ Trigger scaling in/out of instances based on observed workload ■ Setup complex NF chains
Resources 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17.
Sherry, Justine, et al. "Making middleboxes someone else's problem: network processing as a cloud service." ACM SIGCOMM Computer Communication Review 42.4 (2012): 13-24. Palkar, Shoumik, et al. "E2: a framework for NFV applications." Proceedings of the 25th Symposium on Operating Systems Principles. ACM, 2015. Cao, Lianjie, et al. "Data-driven resource flexing for network functions visualization." Proceedings of the 2018 Symposium on Architectures for Networking and Communications Systems. 2018. Qazi, Zafar Ayyub, et al. "SIMPLE-fying middlebox policy enforcement using SDN." ACM SIGCOMM computer communication review. Vol. 43. No. 4. ACM, 2013. Fayazbakhsh, Seyed Kaveh, et al. "Enforcing network-wide policies in the presence of dynamic middlebox actions using flowtags." 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14). 2014. Metron: NFV Service Chains at the True Speed of the Underlying Hardware Embark: Securely Outsourcing Middleboxes to the Cloud Dynamic Service Chaining with Dysco U-HAUL: Efficient State Migration in NFV HyperNF: building a high performance, high utilization and fair NFV platform Virtual Network Functions Instantiation on SDN Switches for Policy-Aware Traffic Steering ORCA: an ORChestration automata for configuring VNFs FERO: Fast and Efficient Resource Orchestrator for a Data Plane Built on Docker and DPDK Merlin: A Language for Managing Network Resources Elastic Scaling of Stateful Network Functions Scaling Up Clustered Network Appliances with ScaleBricks Stratos: A Network-Aware Orchestration Layer for Virtual Middleboxes in Clouds
Credits to figures 1. https://www.unixarena.com/2015/08/openstack-architecture-and-componentsoverview.html/ 2. https://dl.acm.org/citation.cfm?id=2486022
Flow correlation ● Packet collection at SDN Controller ○ ○
For each new flow coming into the middlebox, collect first P packets Collect first P packets for a time window W for all flows going out of the middlebox
SDN Controller Packet collection
Flow correlation
● Calculate payload similarity ○ ○
○
NAT
For each (ingress,egress) pair of flows compute a similarity score Use Rabin fingerprint to hash chunks of packet stream to detect overlapping content
● Identify most similar flow Identify egress flow with highest similarity score with the ingress flow No modification to middleboxes required But flow correlation is not 100% accurate Does not support complex packet modifications
Rule installation
Forwarding rules
p3’
Packets from H1
p3
p2 p1
q3
q2 q1
Packets from H2
p2’ q3’ q2’
T milliseconds
p1’ q1’
In the example above, P=2 and W=T
FlowTags Allow middleboxes to “tag” packets based on the internal context ● Middleboxes generate tags for each flow based on the processing context ● Use the SDN controller to keep mapping from Tag to original IP header (only srcIP is assumed to be part of header)
NFV Orchestrator GENERATE_TAG 1 ⇒ {src_ip=H1} GENERATE_TAG 2 ⇒ {src_ip=H2}
Forwarding rules
NAT
IDS
1
p3’
1
Packets from H1
p3
p2 p1
2
p2’ q3’ 2 q2’
1
2
p1’ q1’
Internet
q3
q2 q1
Packets from H2
TAG FWD to 1 IDS 2 Internet
Requires middlebox modification to follow FlowTags API
FlowTags Allow middleboxes to “tag” packets based on the internal context ● Middleboxes generate tags for each flow based on the processing context ● Use the SDN controller to keep mapping from Tag to original IP header (only srcIP is assumed to be part of header)
NFV Orchestrator GENERATE_TAG 1 ⇒ {src_ip=H1}
CONSUME_TAG 1 ⇒ {src_ip=H1}
GENERATE_TAG 2 ⇒ {src_ip=H2}
Forwarding rules
NAT
1
1
p3’
1
Packets from H1
p3
p2 p1
p1’
2
p2’ q3’ 2 q2’
1
IDS
1
p2’
1
p3’
2
p1’ q1’
Internet
q3
q2 q1
Packets from H2
TAG FWD to 1 IDS 2 Internet
System Issues in Cloud Computing Mini-course: Network Function Virtualization KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
Lesson 4 - Deploying Virtualized Network Functions in managed Cloud infrastructures
Headshot Middleboxes were typically deployed on premises, which led to the practice of deploying NFV applications on on-premise clusters too. However, the growth of the public cloud makes migrating these applications a way to reduce expenditure for enterprises as well as minimize manual management of the NFV applications. This is similar to the transition that enterprises went through when they replaced their on-premise IT infrastructure by managed cloud services. In this lecture we will discuss techniques that enable offloading NFV workload to a managed cloud, as well as other developments in the telecommunications industry that makes offloading NFV workloads viable.
Outline ● Benefits of using managed cloud services for hosting enterprise’s NFs ● Techniques for offloading network functions to managed cloud ● Observed performance of NF offloading ● Mobile edge computing for enabling efficient offloading of NFs ○
Related initiative: OpenCOORD
● Cloud-RAN (Radio Access Network) as another use-case for NFV on MEC
Offloading middlebox processing to the cloud Traditional on-premise NFs NF processing offloaded to cloud Internet Cloud datacenter
Enterprise network
Internet
Traffic Redirection Enterprise network
Why offload NF processing to the cloud ? ● Leverage economy of scale to cut costs ● Simplify management ○ No need for training personnel ○ Upgrades are handled by cloud provider ○ Low-level configuration of NFs is replaced by policy configurations ■ Avoid failures due to misconfiguration ● Elastic scaling ○ Scale in/out works much better on cloud Vs. on premise ■ Avoid failures due to overload
Transition
Important questions to answer ● How is the redirection implemented ? ○ Functional equivalence needs to be maintained ○ Latency should not be inflated ● How to choose cloud provider to offload to ? ○ Dependent on cloud provider’s geographical resource footprint
Bounce redirection ● Simplest form of redirection ● Tunnel ingress and egress traffic to the cloud service ● Benefit: Does not require any modification to the enterprise or the client applications ● Drawback: Extra round trip to the cloud ○ Can be feasible if cloud Point-of-Presence is located close to enterprise
Enterprise Gateway Average latency between Georgia Tech campus and Microsoft Azure regions
Region East US(Virginia)
Average Latency (ms) 24 ms
East US 2(Virginia)
28 ms
North Central US(Illinois)
37 ms
Source : https://www.azurespeed.com/Azure/Latency
IP Redirection ●
● ●
Save extra round-trip by sending client traffic directly to the cloud service Cloud service announces IP prefix on behalf of the enterprise Drawback: multiple Point of Presence (PoP) ○ Cannot ensure that same PoP receives both flows a→b and b→a (There is not guarantee about which PoP ends up receiving client’s packets since all PoPs advertise the same IP address range) ○ Since traffic is directed using
Enterprise Gateway
DNS-based redirection ● Cloud provider runs DNS resolution on behalf of enterprise ● Enterprise can send reverse traffic through the same cloud PoP as forward traffic ○ Gateway looks up destination Cloud PoP’s IP address in DNS Service ● Drawback : loss of backwards compatibility ○ Legacy enterprise applications expose IP
Enterprise GW
Smart redirection For each client c and enterprise site e, Choose the cloud PoP P*(c,e) such that P*(c,e) = arg minP [Latency(P, c) + Latency(P,e)] Requires the enterprise gateway to maintain multiple tunnels to each participating PoP -
Cloud service computed estimate latencies between PoPs and clients/enterprises using IP address information
Transition
Latency inflation due to redirection ● Original latency = Host 1 → Host 2 ● Inflated latency = Host 1 → Cloud PoP → Host 2 ● More than 30% of host pairs have Inflated latency < original latency ○ Triangle-inequality is violated in inter-domain routing ○ Cloud providers are well connected to tier-1/tier-2 ISPs
What about bandwidth savings ? ● Middleboxes like Web Proxy, WAN accelerator are used to limit WAN bandwidth used by enterprise ○ HTTP Proxy limits WAN bandwidth usage by caching web pages ● If we move them to the cloud, WAN bandwidth becomes high for the enterprise ● Safest solution is to not migrate those types of middleboxes
HTTP Proxy
NAT
FW
Enterprise Network
Public Cloud
HTTP Proxy
NAT
FW
High WAN bandwidth consumption Enterprise Gateway Enterprise Network
What about bandwidth savings ? Solution : Use general-purpose traffic compression in Cloud-NFV gateway
HTTP Proxy
NAT
FW
Enterprise Network
Protocol agnostic compression technique achieves similar bandwidth compression as the original middlebox
Public Cloud Traffic HTTP compression Proxy /decompression
NAT
Traffic compression/ decompression
FW WAN bandwidth usage reduced by compressing outgoing traffic
Enterprise Gateway
Enterprise Network
Transition
Which cloud provider to select ? ● Amazon-like footprint ○ Few large (Points-ofPresence) PoPs ● Akamai-like footprint ○ Large number of small PoPs ● Emerging “edge-computing providers”
Telecom providers are ideal for edge computing
● Telecommunication providers like AT&T and Verizon possess a geographical footprint much denser than AWS or Akamai ● Residential Broadband service providers use functions like virtual Broadband Network Gateway (vBNG) ○ To provide residential broadband users with services like subscriber management, policy/QoS management, DNS, routing ○ Service providers also offer services like Video-on-Demand CDN, virtual Set Top Box https://wiki.onap.org/pages/viewpage.action?pageId=3246168 ● Such services are deployed close to the subscribers ○ These compute resources are potential candidates for offloading
Transition
OpenCORD initiative ● ● ● ●
●
●
Telecommunication providers own Central Offices ○ Contain switching equipment OpenCORD : Central Office Re-architected as a Datacenter Setting up central offices with general purpose servers Provides infrastructure services ○ Deploy their own network functions ○ For 3rd parties to deploy NFV functions Allow enterprises to host network functions on virtualized hardware ○ Colocated with telecom provider’s network functions This becomes a candidate realization of mobile edge computing
Location of Central Offices around Atlanta (potential fog location candidates)
Remote sites require illusion of homogeneous network ● Organizations like Chick Fil A or Honeywell have geo-distributed sites ○ Each site needs multiple network services ○ Firewalls, IDS, Deep Packet Inspection, HTTP Proxy, WAN optimizer ● Used to be implemented on custom hardware on-premise ○ Can be offloaded to a managed service
Virtualized customer premise equipment Virtual CPE (Customer Premise Equipment) ● Serves as a gateway for multiple parts of an Enterprise Network to connect to each other ● Placed in Edge PoP or Centralized datacenter ● An industry solution for migrating NFV to a cloud service
Transition
NFs in Cellular Networks Another evolution that is happening that is moving NFV to managed infrastructure ● Converting RAW cellular packets to IP ready packets Different from earlier middleboxes (which were meant for IP packets)
Building blocks of a cellular network ●
●
Access Network ○ Consists of base stations (evolved NodeB - eNodeB) ○ Acts as interface between end-users (User Equipment/UE) and Core Network ○ MAC scheduling for Uplink and Downlink traffic ○ Header compression and user-data encryption ○ Inter-cell Radio Resource Management Core network ○ Mobility control → making cell-tower handoff decisions for each user (Mobility Management Entity - MME) ○ Internet access → IP address assignment and QoS enforcement (Packet Data Network Gateway - P-
Internet Control plane traffic Data plane traffic
Core Network
S-GW Access Network
P-GW MME
Traditional Radio-Access Networks ● Packet processing in access network perform 2 types of tasks ○ Analog radio function processing (RF processing) : ■ Digital-to-analog converter / Analog-to-digital conversion ■ Filtering and amplification of signal ○ Digital signal processing (Baseband processing) : ■ L1, L2 and L3 functionality
Evolution of base stations so far
Antenna
Antenna
Coaxial cable
RF and Baseband processing are co-located in one unit (inside a base station).
to Core Network (S1 interface) to other base stations (X2 interface)
Baseband
3G and 4G networks
Base -band RF
to Core Network (S1 interface) to other base stations (X2 interface)
1G and 2G networks
RF
RRH
Fiber
Functions split between Remote Radio Head (RRH) and BaseBand Unit (BBU). BBU is typically located within 20-40 kms away from
Benefits/Limitations of 3G/4G design Benefits:
Limitations: ● Static RRH-to-BBU assignment ⇒ Resource underutilization ● BBUs are implemented as specialized hardware ⇒ Poor scalability and failure handling
3G and 4G networks Antenna
to Core Network (S1 interface) to other base stations (X2 interface)
Baseband
● Lower power consumption since RF functionality can be placed on poles/rooftops ⇒ efficient cooling ● Multiple BBUs can be placed together in a convenient location ⇒ cheaper maintenance ● One BBU can server multiple RRHs
RF
RRH
Transition
Cloud Radio Access Network ● Virtualizes the BBUs in a BBU Pool ● Base-band Unit now implemented as software running on general purpose servers ● Allows elastic scaling of BBUs based on current workload ● BBU-RRH assignment is dynamic, leading to higher resource utilization
RF RRH
RF
RRH
Baseband Baseband Baseband
Virtualized BBU Pool
RF RRH
RF RRH
Location of virtual BBU Pool? ● Splitting Radio Function and Base Band processing poses stringent requirements on connecting links ○ Low latency ○ Low jitter ○ High throughput ● Need compute capacity in physical proximity of deployed base stations ○ Geo-distributed computing infrastructure ○ Virtualization support required for scalable network processing
The complete picture 1. Cellular network providers setup geo-distributed MEC capacity 2. C-RAN functions are deployed on MEC servers 3. MEC capacity is made available for enterprises to offload their NFs Coming together of IP level network functions and the RAN level NFs
Internet
Core Network
vBNG
vBNG
Baseband
HTTP Proxy
NAT
FW
Virtualization layer
Enterprise offloading middleboxes
Resources 1. Sherry, Justine, et al. "Making middleboxes someone else's problem: network processing as a cloud service." ACM SIGCOMM Computer Communication Review 42.4 (2012): 13-24. 2. MEC Deployments in 4G and Evolution Towards 5G https://www.etsi.org/images/files/etsiwhitepapers/etsi_wp24_mec_deployment_in_4g_5g_final.pdf 3. Checko, Aleksandra, et al. "Cloud RAN for mobile networks—A technology overview." IEEE Communications surveys & tutorials 17.1 (2014): 405-426. 4. CORD - Wiki Home https://wiki.opencord.org/display/CORD/Documentation 5. Virtualizing Customer Premises With Service Function Chaining https://www.opnfv.org/wp-content/uploads/sites/12/2016/11/opnfv_odl_vcpe_sfc_brief.pdf
Credits for figures 1. Sherry, Justine, et al. "Making middleboxes someone else's problem: network processing as a cloud service." ACM SIGCOMM Computer Communication Review 42.4 (2012): 13-24. 2. Virtualizing Customer Premises With Service Function Chaining (Accessed : 03/11/2020) https://www.opnfv.org/wp-content/uploads/sites/12/2016/11/opnfv_odl_vcpe_sfc_brief.pdf
Closing Headshot That concludes our mini course on NFV. NFVs evolved as cousin of SDN and now occupies its own position due to the ubiquity and necessity of NFs in the enterprise IT landscape.
NFV Workshop 1: Introduction to Virtual Network Functions
Abstract In this workshop, you will get introduced to Docker and Open vSwitch as the fundamental technologies for creating virtual network topologies. Next, you will get introduced to creating Firewall and Network Address Translator (NAT) network functions in software. You will learn how network functions are typically deployed in an enterprise network. Sections 2 and 4 provide relevant background and usage respectively of Docker and Open vSwitch. An experienced student may skip those and jump straight to Section 5 which describes the tasks to be completed in the workshop.
1
Expected outcome
In this workshop, the student would learn about: Part A 1. Using Docker as virtualization tool to create virtual hosts 2. Using Open vSwitch to create a virtual network topology 3. Connect the virtual hosts to virtual switches and test connectivity Part B 4. Using iptables and route Linux utilities to implement Firewall and NAT This workshop and all subsequent ones need to be implemented on a Linux host. All workshops have been tested on an Ubuntu 18.04 Virtual Machine on VirtualBox.
2
Background
This section provides a basic background about Docker and Open vSwitch.
2.1
Docker
Docker is essentially operating-system-level virtualization which allows developing and deploying software in packages called containers. Docker containers are isolated from each other and bundle their own software, libraries and configuration files - and therefore are platform independent. Containers use the kernel of the host machine, and are therefore much more lightweight compared to virtual machines. A container is deployed using an image that specifies the details of the container. An image is usually built by extending a base image (e.g. ubuntu-trusty) with specific configurations. Install Docker on the Linux host by following the instructions in the following link : How To Install and Use Docker on Ubuntu 18.04 1 . Upon installing Docker, you need to start Docker service by using the following command. $ sudo systemctl start docker
2.2
Open vSwitch
Open vSwitch is a multilayer software switch licensed under the open source Apache 2 license. The software aims to provide a production quality switch platform that supports standard management interfaces and opens the forwarding functions to programmatic extension and control. Install Open vSwitch on the Linux host using the following command. $ sudo apt - get install bridge - utils openvswitch - switch Upon installing Open vSwitch, you need to start the Open vSwitch service by using the following command. $ sudo / usr / share / openvswitch / scripts / ovs - ctl start
3
Download Repo
The workshop repo contains Dockerfiles for this workshop. $ git clone https :// github . gatech . edu / cs8803 - SIC / workshop10 . git 1 https://www.digitalocean.com/community/tutorials/how-to-install-and-use-docker-on-ubuntu-18-04
1
4
Basic functionalities
This section describes the basic command-line tools for using Docker and Open vSwitch for the purpose of this workshop.
4.1
Running a vanilla Docker container
Execute the following command to launch a Docker container. $ docker run - it ubuntu : trusty sh Through this command, we have instructed the Docker runtime to 1. Launch a container based on the ”ubuntu:trusty” image 2. Launch the ”sh” terminal in the container 3. Attach to an interactive tty in the container using the ”-it” flag Exiting from the terminal will kill the ”sh” shell - which is the main process of the container. This will lead to the container being killed by the Docker runtime.
4.2
Creating a custom Docker image
Now we will extend the base ”ubuntu” image to include additional tools. The recipe of a Docker image is contained in a Dockerfile, whose contents are shown below. The file’s name should just be ”Dockerfile” with no filename extensions. # Specify the base image that this image will be based on FROM ubuntu : trusty # Use the RUN keyword to specify specific changes to be made to the base image RUN apt - get update && apt - get install -y iperf3 RUN apt - get install -y tcpdump && mv / usr / sbin / tcpdump / usr / bin / tcpdump RUN apt - get install -y hping3 Now create an image named ”endpoint” using the created Dockerfile. You will need to navigate to the directory containing the Dockerfile before executing the following command. This will by default use the ”Dockerfile” file in the directory. Note that you can specify a different Dockerfile to use with a different name using the -f flag. $ docker build -t endpoint : latest .
4.3
Executing a command on running container
The following command shows how to generate 3 ICMP requests to the host 192.168.1.2 from the container ”h1”. $ docker exec - it h1 ping -c 3 192.168.1.2 The options ”-it” are used in order to have the command executed interactively instead of in the background. This command can be used to run a single command on a running Docker container. For more information check out this page : docker exec2 . You could also do the following in order to launch a shell on a container. $ docker exec - it h1 bash The above command will open a new bash shell in the container, which can be used to execute commands on the container just as you would on a bare-metal machine.
4.4
Creating a virtual switch
In the following command we use Open vSwitch’s configuration utility to create a new virtual switch (also called bridge). $ sudo ovs - vsctl add - br ovs - br1
4.5
Connecting a Docker container to virtual switch
In the following command we add a port ”eth0” on container ”cnt” with an IP address ”192.168.1.2/24” that connects to virtual switch ”ovs-br1”. When adding interfaces to containers, take a note of the interface names as they will be used later in the workshop. $ sudo ovs - docker add - port ovs - br1 eth0 cnt -- ipaddress = " 192.168.1.2/24 "
2
Figure 1: Schematic of network topology for this exercise.
5
Workshop Specification
5.1
Part A : Building a simple virtual network topology
In this workshop you are first required to implement the network topology shown in Figure 1. You can refer to Section 4 for hints on commands used to complete the required tasks. The target topology consists of two Docker containers connected to each other using an Open vSwitch. First, you need to create a Docker image using the Dockerfile at this link3 . Use the tag ”endpoint:latest” for the created image. $ docker build -f Dockerfile . endpoint -t endpoint : latest . Then you need to create containers using that image. The following command shows how to run a container with specific arguments suited for this workshop. $ docker run -d -- privileged -- name = int -- net = none endpoint : latest tail -f / dev / null 1. The default networking needs to be disabled in Docker containers using the ”–net=none” argument. 2. The container should be run with the ”–privileged” flag 3. The ”-d” flag makes the container run as a daemon 4. The container should be alive for the duration of the experiment. Hence the container is instructed to follow the contents of the file /dev/null, which effectively makes it stay alive following an empty file forever. Upon setting up the network topology you need to verify that the Docker containers are able to communicate with each other through the switch.
5.2
Part B : Creating network functions in software
The next part of this workshop is further subdivided into two parts : (1) firewall , and (2) NAT. For both subparts, the topology has 2 end hosts that are connected together through a network function (also called middlebox ). In the first subpart, the middlebox is a Firewall, while in the second part it is a NAT. End-hosts and middlebox are supposed to perform different functions, hence we create two different images for them.
5.2.1
Create Docker image of end hosts
The endpoint hosts will be implemented as Docker containers. The Dockerfile for their image can be found here : Endpoint Docker4 . Note that this is the same Dockerfile used in Part A. Use the following command to create the Docker image named ”endpoint”. $ docker build -f Dockerfile . endpoint -t endpoint : latest .
5.2.2
Create Docker image for the middlebox
The network function will also be implemented as a Docker container. For this workshop, we will use the same Docker image for both the network functions and selectively configure the container to behave either as a NAT or a firewall. The Dockerfile for their image can be found here : Middlebox Docker5 . Use the following command to create the Docker image named ”nf”. $ docker build -f Dockerfile . nf -t nf : latest . 2 https://docs.docker.com/engine/reference/commandline/exec/ 3 https://github.gatech.edu/cs8803-SIC/workshop10/blob/master/Dockerfile.endpoint 4 https://github.gatech.edu/cs8803-SIC/workshop10/blob/master/Dockerfile.endpoint 5 https://github.gatech.edu/cs8803-SIC/workshop10/blob/master/Dockerfile.nf
3
5.2.3
Firewall
Implement the network topology shown in Figure 2. 1. Create Open vSwitch virtual switches (bridges). $ sudo ovs - vsctl add - br ovs - br1 2. Create Docker containers for the endpoints $ docker run -d -- privileged -- name = int -- net = none endpoint : latest tail -f / dev / null $ docker run -d -- privileged -- name = ext -- net = none endpoint : latest tail -f / dev / null 3. Create Docker container for the network function $ docker run -d -- privileged -- name = fw -- net = none nf : latest tail -f / dev / null 4. Connect the endpoints to OVS bridges $ sudo ovs - docker add - port ovs - br1 eth0 int -- ipaddress =192.168.1.2/24 Similarly connect the container ext (that represents external host) to the switch ovs-br2 and assign it the IP address as shown in Figure 2. Similarly create 2 ports on f w container with the appropriate IP addresses as shown in Figure 2. 5. Now setup the routes on the endpoint hosts. $ docker exec int route add - net 145.12.131.0/24 gw 192.168.1.1 dev eth0 $ docker exec ext route add - net 192.168.1.0/24 gw 145.12.131.74 dev eth0 These routes enable to endpoint hosts to know how to send packets to the other endpoint. 6. Enable network forwarding on network function container $ docker exec fw sysctl net . ipv4 . ip_forward =1 7. Configure Firewall to block incoming packets with destination port 22 (incoming SSH connections) $ docker exec fw iptables -A FORWARD -i eth1 -p tcp -- destination - port 22 -j DROP
Figure 2: Schematic of network topology for the Firewall exercise.
5.2.4
Network Address Translation
Implement the network topology shown in Figure 3. Perform the following steps after steps 1-5 from the Firewall exercise. Note that in this topology, the middlebox container is named nat. 6. Enable network forwarding on network function container $ docker exec nat sysctl net . ipv4 . ip_forward =1 7. Use iptables to setup NAT $ docker exec nat iptables -t nat -A POSTROUTING -o eth1 -j MASQUERADE $ docker exec nat iptables -A FORWARD -i eth0 -o eth1 -m state -- state \ ESTABLISHED , RELATED -j ACCEPT $ docker exec nat iptables -A FORWARD -i eth1 -o eth0 -j ACCEPT
4
Figure 3: Schematic of network topology for the NAT exercise.
6
Requirements
6.1
Requirements for Part A
For Part A of the workshop, verify that the following requirements are met. • The topology in Figure 1 needs to correctly implemented. Both the Docker containers should be able to ping (ICMP) each other. • Use the nc command-line tool (usage of nc can be found in Section 8) to generate TCP traffic to a specific port. Use tcpdump on network interfaces on both Docker containers to record packets being sent from and received on that interface. Note that you would need to launch nc and tcpdump on multiple terminals to listen on multiple interfaces. For example, one terminal runs tcpdump to record packets sent/received on interface eth0 of container int, and another terminal runs tcpdump to record packets sent/received on interface eth0 of container ext.
6.2
Requirements for Part B
For the Part B of the workshop (involving virtual network functions), verify that the following requirements are met. • The topology in Figures 2 and 3 needs to correctly implemented. Both the endpoints (internal and external hosts) should be able to ping (ICMP) each other. You can use the Linux ping utility to test this. • Use tcpdump on all network interfaces on all Docker containers to record packets being sent from and received on that interface. In order to generate traffic from and to specific ports, use the nc command-line tool (usage of nc can be found in Section 8) For the Firewall exercise, record traffic for two types of activity. 1. Traffic originating from External Host to Internal Host for port 22. This flow will be blocked by the Firewall. 2. Traffic originating from External Host to Internal Host for port 80. This flow will not be blocked by the Firewall. For the NAT exercise, record traffic for one activity. 1. Traffic originating from Internal Host to External Host for port 80. Make sure to record the transformation of IP addresses by the NAT middlebox.
7
Deliverables
For Part A: • An executable script ”setup topology.sh” that performs all actions needed for setting up the topology for part A of this workshop. The script should be complete, in that after executing the script, the containers should be reachable from each other. • Show a demo of the working topology to the TA. • A document containing the packet traces from tcpdump for the TCP traffic test explained in Section 6.1. The document should mention the various types of protocols seen in the packet trace and the role of each protocol. Special attention should be given to ARP packets as ARP will be used in the next workshops. For Part B: • Two executable scripts setup fw topology.sh and setup nat topology.sh that set up the topology for each workshop part. The script should also perform all necessary configuration of the middlebox, so that no further commands executions are needed before starting reachability tests. • A demo of the working topology for both Firewall and NAT. • A document containing the packet traces from tcpdump for each scenario explained in Section 6.2. The document should mention the various types of protocols seen in the packet trace and the role of each protocol. Special attention should be given to ARP packets as ARP will be used in the next workshops.
5
8
Useful References
8.1
Usage of nc
1. Start listening on the destination port (port 80 in this example) $ nc -l 80 2. Initiate connection to the destination port (destination host IP is 192.168.1.2) $ nc 192.168.1.2 80 3. Send data on the TCP connection by typing plaintext 4. Terminate the TCP connection by hitting Ctrl+C on either terminals
8.2
Other useful references
• How to create Docker Images with a Dockerfile6 • A tcpdump Tutorial with Examples — 50 Ways to Isolate Traffic7 • Introduction to Firewall8 • How NAT works9
6 https://www.howtoforge.com/tutorial/how-to-create-docker-images-with-dockerfile/ 7 https://danielmiessler.com/study/tcpdump/ 8 https://www.geeksforgeeks.org/introduction-to-firewall/ 9 https://computer.howstuffworks.com/nat.htm
6
NFV Workshop 2: SDN for basic traffic forwarding to a network function
Abstract In this module you will learn how to use SDN controller to configure OpenFlow switches to forward packets to and from a network function instance.
1
Context
In the previous section we learned about how a middlebox/network function can be modeled as software and deployed in an enterprise network. However, the placement of the network function was at the egress point of the enterprise network. Therefore, by design, all traffic going in and out of the enterprise had to traverse the network function. Such a design puts constraints on the flexibility of the NFV infrastructure. We would want the placement of network functions to be no different than the placement of regular applications in an enterprise’s on-premises compute cluster. Maintaining flexibility of deployment while ensuring correctness of traffic forwarding can be provided with the help of SDN - as you will learn in this workshop.
2
Expected outcome
The student will learn how to setup packet forwarding rules on OpenFlow switches such that traffic traverses through the network function instance. Forwarding will be done using the Ryu SDN Controller without requiring any changes to the endpoint hosts.
3
Download Repo
The workshop repo contains Dockerfiles and boilerplate code for the ryu app that the student needs to implement. $ git clone https :// github . gatech . edu / cs8803 - SIC / workshop11 . git
4
Specification
Implement the topology shown in Figure 1. The switch shown in the figure is an OpenFlow-enabled switch and communicates with an SDN controller (not shown in the figure). For this workshop we would be using the Ryu SDN controller, which can be installed by following instructions on this webpage : Ryu : Getting Started1 . The Ryu SDN controller would be used along with OpenFlow version 1.3.
Figure 1: Schematic of network topology. The container names of each host has been mentioned in bold-italics. All the packets flowing between end hosts have to traverse the network function. The end hosts are supposed to be unaware of the added network function. Thus the routing tables are unchanged. Therefore the internal host issues ARP requests for the MAC address of the external host, and vice versa. These ARP requests are received on the 1 https://ryu.readthedocs.io/en/latest/getting
started.html
1
switch and sent to the controller . The controller needs to create an ARP response for the request and send it to the requesting device. The Firewall should be configured to block incoming traffic to port 22.
4.1
Create Docker image of end hosts
The endpoint hosts will be implemented as Docker containers. The Dockerfile for their image is the same as the previous workshop, and can be found here : Endpoint Docker2 . Use the following command to create the Docker image named ”endpoint”. $ docker build -f Dockerfile . endpoint -t endpoint : latest .
4.2
Create Docker image for the middlebox
The Firewall network function will also be implemented as a Docker container. For this workshop, we will use the same Docker image for the network function as in Workshop 2 and selectively configure the container to behave as a firewall. The Dockerfile for their image can be found here : Middlebox Docker3 . Use the following command to create the Docker image named ”nf”. $ docker build -f Dockerfile . nf -t nf : latest .
5
Implementation
5.1
Topology
Implement the topology shown in Figure 1. Use the same port names (e.g. eth0) as shown in the topology. For the sake of simplicity, when creating network ports on Docker containers, hard-code the MAC address of the ports as the same values will be used in the Python code for Ryu controller. 1. Creating the Open vSwitch bridge $ # # $ # $
sudo ovs - vsctl add - br ovs - br1 Specifying the URL of SDN controller ( In this example 192.168.56.2 is the IP address of the host running Ryu controller ) sudo ovs - vsctl set - controller ovs - br1 tcp :192.168.56.2:6633 Preventing the OVS switch from entering standalone mode when the controller is down sudo ovs - vsctl set - fail - mode ovs - br1 secure
2. Specifying MAC address when creating a port on the network function container $ sudo ovs - docker add - port ovs - br1 eth0 nf -- macaddress = " 00:00:00:00:02:01 " $ sudo ovs - docker add - port ovs - br1 eth1 nf -- macaddress = " 00:00:00:00:02:02 " 3. Enabling packet forwarding and adding routing entries on network function container $ docker exec nf sysctl net . ipv4 . ip_forward =1 $ docker exec nf ip route add 192.168.1.0/24 dev eth0 $ docker exec nf ip route add 145.12.131.0/24 dev eth1 4. Adding routing entries on the end hosts. $ docker exec int ip route add 145.12.131.0/24 dev eth0 $ docker exec ext ip route add 192.168.1.0/24 dev eth0
5.2
Creating and sending ARP Reply
In this workshop, the Internal Host would generate ARP request for the External Host and vice versa. The IP addresses of both these hosts would be known to the SDN controller application - which will then create an ARP Reply and then send it to the requesting entity. The following code listing shows how to create and send an ARP reply. def send_arp_reply ( self , datapath , src_mac , src_ip , dst_mac , dst_ip , out_port ): opcode = 2 e = ethernet ( dst_mac , src_mac , ether . ETH_TYPE_ARP ) a = arp (1 , 0 x0800 , 6 , 4 , opcode , src_mac , src_ip , dst_mac , dst_ip ) p = Packet () p . add_protocol ( e ) p . add_protocol ( a ) p . serialize () 2 https://github.gatech.edu/cs8803-SIC/workshop11/blob/master/Dockerfile.endpoint 3 https://github.gatech.edu/cs8803-SIC/workshop11/blob/master/Dockerfile.nf
2
actions = [ datapath . ofproto_parser . OFPActionOutput ( out_port , 0)] out = datapath . ofproto_parser . OFPPacketOut ( datapath = datapath , buffer_id =0 xffffffff , in_port = datapath . ofproto . OFPP_CONTROLLER , actions = actions , data = p . data ) datapath . send_msg ( out ) The above ”send arp reply” function for sending an ARP reply has been implemented for you in the ”workshop parent.py” file provided with the workshop. You need to implement the function ”handle arp”, wherein you would have to lookup the right MAC address for a requested IP address and call the ”send arp reply” function with that.
5.3
Packet forwarding
Packet forwarding needs to be implemented based on whether a packet has been processed by the firewall or not. In the network topology for this workshop, the port of the switch that receives a packet can be used to determine whether the packet is received from the Internal Host, Firewall or the External Host; the port can also be used to tell whether the packet is travelling from the Internal Host to External Host or in the other direction. Once the stage of packet processing is figured out, the forwarding can be done by taking the following 2 steps : 1. Update the destination MAC address of the packet to the correct port on the container that this packet should be sent to. For instance, if the packet is received on Port 1 of the switch, it implies that the packet is coming from the Internal Host and needs to be send to the Firewall. Hence, the destination MAC address of this packet is changed to that of the interface eth0 of the Firewall container. 2. Specify the output port on the switch that the packet should be sent out from. For instance, in the previous point, the packet would be sent out of Port 2 on the switch, so that it reaches interface eth0 of the Firewall. Once the forwarding decision is made, install the flow on the switch. Note that for installing the flow, use only the Ethernet source and destination as the match fields for now.
6
Deploy and Test
Create a file named setup topology.sh that creates the network topology. Use the Ryu Python API to create a Python script ryu apppy containing the SDN controller logic. Create the network topology using the command $ ./ setup_topology . sh Start the SDN controller using the ryu − manager command $ ryu - manager ryu_app . py Launch iperf 3 server on the External Host $ docker exec - it ext iperf3 -s Launch iperf 3 client on the Internal Host $ docker exec - it int iperf3 -c 145.12.131.92 Verify that the number of packets sent from the Internal Host match the number of packets received by the External Host. The if conf ig tool can be used for this. The number of Rx/Tx packets on end hosts should also equal number of Rx packets on interface eth0 and Tx packets of interface eth1 of the Firewall. This match of measured number of packets indicates that all packets going out of the enterprise passed through the Firewall. A same test should be done for traffic in the opposite direction. Now check the rules installed on the switch $ ovs - ofctl -O OpenFlow13 dump - flows < switch_name >
7
Deliverables
At the end of the class, the student needs to show to the TA: • Successful iperf3 communication between Internal and External Hosts. • Rx/Tx packet counts on network interfaces of Internal and External Hosts and the Firewall NF host. • Flows installed on the switch using the following commands $ ovs - ofctl -O OpenFlow13 dump - flows < switch_name > • Verifying ext to int communication on port 22 is blocked The following artifacts need to be submitted. • Script for setting up the topology.
3
• Python ryu app.py code containing the Ryu controller logic. You can also submit the workshop parent.py file in case you made changes to it. • Document containing a description of your design. • Flows installed on the switch using the following commands $ ovs - ofctl -O OpenFlow13
dump - flows < switch_name >
4
NFV Workshop 3: Connection-affinity with multiple NF instances
Abstract In this module you will learn how to use SDN controller to configure traffic forwarding to multiple instances of a Firewall network function.
1
Context
Due to the elastic nature of NFV infrastructure, scaling up and down is commonplace. Hence, the traffic forwarding should take into account the fact that there can be multiple instances of an NF. In such a scenario, the forwarding logic should maintain flow-affinity, i.e. all packet belonging to the same flow should be sent to the same network function. Furthermore, in some cases, sending packets of flows in both directions to the same instance is also required, which is called connection affinity.
2
Expected outcome
The student will learn how to setup packet forwarding rules among multiple instances of a Firewall NF such that connection-affinity is maintained.
3
Download Repo
This workshop repo is largely the same as the previous workshop’s repo, and contains boilerplate code and Dockerfiles, with minor changes to the structure of ryu app.py. This workshop can be thought of as an extension of the previous workshop, feel free to use any code from the previous workshop as a starting point. Code from this project can be used as a starting point for the NFV project, the ninor changes in ryu app.py are intended to better fit the needs for this workshop and the project. $ git clone https :// github . gatech . edu / cs8803 - SIC / workshop12 . git
4
Specification
Implement the topology shown in Figure 1. Remember that for the sake of simplicity, you should hardcode the MAC address of the ports on Docker containers, as you would need to use them in the SDN controller code1 . Configure the Firewall instances to block incoming traffic on port 22.
Figure 1: Schematic of network topology. The SDN controller application needs to multiplex flows between the two Firewall instances. In this exercise all packets that have the same values for the following fields are part of the same flow: • Source IP address 1 You
can assume that MAC address information is provided by an external NFV management service in a real world system
1
• Destination IP address • Source TCP port • Destination TCP port Once an NF instance has been decided for a new flow, the forwarding rule should be installed in the OpenFlow switch, so that subsequent packets in the same flow don’t trigger the SDN controller. NF instance should be selected based on round-robin policy. Another requirement of this workshop is connection affinity in packet forwarding. In other words, packets belonging to flows in both directions for a given connection need to traverse the same NF instance.
5
Implementation
The implementation of the topology is similar to the Workshop 2. One important thing to note is that when installing a flow-entry that matches against transport-layer ports (TCP ports) on an OpenFlow switch, the packet match predicate needs to contain all the fields in the lower layers as well, namely • Ethernet type • Source Ethernet address • Destination Ethernet address • IP protocol • Source IP address • Destination IP address • Source TCP port • Destination TCP port For implementing connection affinity, you need to maintain a mapping between a flow and its chosen NF instance in the controller. When a new flow is seen, check against this state to determine whether the reverse flow has been installed.
6
Deploy and Test
Follow the steps mentioned in the previous workshop to implement the traffic forwarding rules. Test the NF chain by generating traffic originating from the Internal Host to the External Host. In order to generate TCP flows, use the hping3 tool. Make sure that multiple connections are created. The following command will use hping3 to send multiple SYN packets from the int to the ext host on destination port 80. It will use multiple source ports to do the same, therefore generating multiple flows. $ docker exec - it int hping3 -V -S -p 80 -s 5050 145.12.131.92 You should also test communication using iperf3 tool as it will generate a large number of packets which makeschecking Receive (Rx) and Transmit (Tx) packet counts easier. Launchiperf3 server on the External Host $ docker
exec - it ext
iperf3
-s
Launchiperf3 client on the Internal Host and have it generate multiple parallel connections to the server usingthe ”-P” flag. $ docker
exec - it int
iperf3
-c 145.12.131.92
-P 16
The above command generates 16 parallel connections to the server. Check the flow tables to ensure that multiple flows are forwarded to multiple network functions.
7
Deliverables
At the end of the class the student needs to show the TA • The OpenFlow rules installed on the switches • Rx/Tx packet counts on network interfaces of Internal and External Hosts should equal the total Tx/Rx packetcounts on the interface on both Firewall hosts that face the given endpoint. • Flows installed on the switch using the following commands $ ovs - ofctl -O OpenFlow13
dump - flows < switch_name >
The following artifacts need to be submitted. • Script for setting up the topology. • Python ryu app.py code containing the Ryu controller logic. You can also submit the workshop parent.py file in case you made changes to it. • Document containing a description of your design. • Flows installed on the switch using the following commands $ ovs - ofctl -O OpenFlow13
dump - flows < switch_name >
2
Module Project: Network Function Virtualization
1
Background
1.1
Network Function (NF) Chains
Network functions are seldom deployed in isolation. Oftentimes multiple network functions are chained together, such that the output of one NF forms the input of another NF. One example is an enterprise that deploys a Firewall and a NAT NF. Traffic leaving the enterprise first traverses the Firewall and then the NAT before emerging out of the enterprise. The reverse order of NFs is traversed by traffic entering the enterprise.
1.2
NFV Orchestrator
A typical NFV control plane consists of an NFV Infrastructure (NFVI) Manager and SDN Controller (Figure 1). The functions of NFVI Manager and SDN Controller are described below.
Figure 1: High level architecture of a typical NFV orchestrator. NFVI Manager. The NFVI Manager is responsible for managing the computational resources (servers) in the NFV cluster. It deploys network function (NF) instances on the servers and monitors them for failures and resource utilization. System administrators communicate with the NFVI Manager to register NF Chains for a specific tenant’s traffic - this communication can be done using a high level API like REST. Information about registered NF chains and deployed NF instances is stored in the ”cluster state”. SDN Controller. The SDN Controller is responsible for managing the network resources and configure packet forwarding so that end-to-end NF chains can be realized. It queries the ”cluster state” to determine where specific NF instances are deployed. It also performs discovery of network topology and updates that information in the cluster state. Communication between SDN Controller and switches is done using the SDN southbound API (OpenFlow).
2
Expected outcome
The student is going to learn how to build an orchestration layer for NFV. This involves creating a web service for deploying virtual network functions which communicates with an SDN control application for configuring traffic forwarding to the created NF instances. The student will use the created setup for dynamically scaling the NF chain deployment while ensuring connection-affinity in packet forwarding.
3
Download Repo
The repo for the NFV project provides scripts to setup topology as specified in section 4.1, as well as various scripts and JSON files to test your NFV orchestrator. Please use Dockerfiles from previous workshops to build Docker images for hosts and switches. Please use ryu app.py and workshop parent.py from the previous workshop as a starting point for this project. $ git clone https :// github . gatech . edu / cs8803 - SIC / project3 . git
1
Test code, traffic profiles, and other JSON code serve as a guideline for students to start with, students are expected to modify the test profiles for the demo.
4
Sections of the project
4.1
Network topology
The network topology of the infrastructure is as shown in Figure 2. For the sake of simplicity, this is fixed for the project. Use this assumption to simplify your implementation.
Figure 2: Schematic of network topology. Network functions that are deployed dynamically will be connected to one of the two switches (ovs-br1 and ovs-br2 ).
4.2
Running NFV Control Plane on a single-node
This project is designed to be deployed on a single machine, as the previous workshops. The entire NFV control plane, i.e. NFVI Manager and SDN Controller are implemented as a Python process (as a Ryu application running with ryu-manager). The ”cluster state” is maintained as in-memory data structures. The NFVI Manager should use the ”subprocess” module in Python to execute commands for creating Docker containers on the host machine.
4.3
NFVI Manager : Web service for launching network functions
Use the REST linkage feature of Ryu controller (as described in this tutorial1 ) to listen to deployment and scale-up requests coming from the system administrator. The requests are sent using the HTTP PUT method and the body of the request is represented as a JSON. HTTP requests containing a JSON body can be sent using the following bash command. $ curl -- header " Content - Type : application / json " -- request PUT \ -- data @sfc_1 . json http :// localhost :8080/ register_sfc You need to support two types of requests : (1) registering a chain of network functions, and (2) launching instances of network functions belonging to a service chain. 4.3.1
Registering a network function chain
The JSON document containing request body for registering a network function chain should look like the following and contain all the mentioned fields. 1 2 3 4 5 6 7 8 9 10 11 12 13 14
{ " nf_chain " : [ " fw " , " nat " ] , # custom identifiers for NFs ( referenced below ) " chain_id " : 1 , " nat " : { # reference to the 2 nd NF in the nf_chain field " image " : " nat " , # name of Docker image " interfaces " : [ " eth 0 " , " eth 1 " ] , # interfaces to be created on container " init_script " : "/ init_nat . sh " # script to be run on container startup }, " fw " : { # reference to the 1 st NF in the nf_chain field " image " : " fw " , # name of Docker image " interfaces " : [ " eth 0 " , " eth 1 " ] , # interfaces to be created on container " init_script " : "/ init_fw . sh " }, " SRC " : { # Specifies where the src host is connected in the nw topology 1 https://osrg.github.io/ryu-book/en/html/rest
api.html
2
" MAC " : " 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 1 " , " IP " : " 1 9 2 . 1 6 8 . 1 . 2 " , " SWITCH_DPID " : 1 , # switch DPID that SRC connects to " PORT " : 1 # switch port that SRC connects to
15 16 17 18
}, " DST " : { # Specifies where the dest host is connected in the nw topology " MAC " : " 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 2 " , " IP " : " 1 4 5 . 1 2 . 1 3 1 . 9 2 " , " SWITCH_DPID " : 2 , # switch DPID that SRC connects to " PORT " : 1 # switch port that DST connects to }
19 20 21 22 23 24 25 26
} Each chain will be associated with a single SRC-DST host pair. A single SRC-DST host pair can only have a single chain associated with it. IMPORTANT. For this project, since NF instances are created on the fly, all configuration would have to be done on the fly. You will therefore need to maintain two Docker images, one for each NF (fw and nat in this case). Each image should contain an executable script that will be called upon container creation with the appropriate arguments for configuring the container. For instance, in the above NF chain registration request, the NF ”nat” has an initialization script in the file ”/init-nat.sh” which contains all the commands for configuring the container to start performing address translation, given that it is called with the right arguments. The arguments for this script will be described in the next section (Section 4.3.2). 4.3.2
1
{ " chain_id " : 1 , " nat " : [ # array with each element being a nat instance { # instance 1 of nat " args " : [ # arguments for init script "192.168.1.1/24", "145.12.131.74/24" ], " ip " : { # IPs for interfaces ( optional ) " eth 0 " : " 1 9 2 . 1 6 8 . 1 . 1 " , " eth 1 " : " 1 4 5 . 1 2 . 1 3 1 . 7 4 " } }, { # instance 2 of nat " args " : [ " 1 9 2 . 1 6 8 . 1 . 1 1 / 2 4 " , " 1 4 5 . 1 2 . 1 3 1 . 7 5 / 2 4 " ] , " ip " : { " eth 0 " : " 1 9 2 . 1 6 8 . 1 . 1 1 " , " eth 1 " : " 1 4 5 . 1 2 . 1 3 1 . 7 5 " } } ], " fw " : [ # array with each element being a fw instance { " args " : [ " 1 9 2 . 1 6 8 . 1 . 0 / 2 4 " , " 1 4 5 . 1 2 . 1 3 1 . 0 / 2 4 " ] } # instance 1 of fw ]
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Launching instances of an NF chain
} Note that this API is not restricted to launching the initial instances of the network functions. The student should implement the API in such a way that is can be used to add NF instances to an existing/running NF chain as well. In this project, students are not required to implement the scaling down primitive, as that requires complicated monitoring of flow termination.
4.4
Launching network functions
Upon receiving the request to launch NF instances for a given NF chain, you need to launch Docker containers. When launching Docker containers you need to take the following steps: 1. Select the switch on which to connect the container. 2. Run the container using the Docker CLI. 3. Add ports to the container and the chosen switch using ovs-docker CLI. 4. Run the initialization script in the container with arguments provided in the JSON body of launch request (Section 4.3.2). 5. Extract relevant information from the container (e.g. MAC addresses) and add inform the NF chaining application (running in the SDN controller) of the new NF instance. 3
4.5
Load balancing between network function instances
The load balancing policy to be used in this project is basic round-robin. Previously unseen flows need to be balanced between instances of a given network function in the chain using the round-robin policy. 4.5.1
Connection-affinity for NAT NF
As you have seen in previous workshops that the NAT NF modifies packet headers. Therefore, maintaining connection affinity for a chain containing such an NF is more complex than by performing a lookup of previously installed flows. One hack that you can use is to maintain another table in the SDN controller that keeps track of which instance of an NF did a particular flow emerges from. For example when NAT instance 2 changes the header of packets in a flow and sends them to the connecting switch, you would know that the modified flow was processed by NAT instance 2 and packets of the opposite direction flow need to be sent to NAT instance 2.
4.6
Responding to ARP requests
In the JSON specification for registering an NF chain, the admin provides the location in network topology where the endpoints are connected (check the sample JSON listing in Section 4.3.1). In addition to the endpoints, when creating each NF instance you are also supposed to record the location in network topology where the NF instance was deployed. The endpoints and NF instances are the two sources from where ARP requests can originate. Now you know all the points where an ARP request can originate for a given NF chain. So when the SDN controller receives an ARP request, it would know the NF chain that the request corresponds to. You are also supposed to maintain a list of all the IP addresses allotted for that NF chain, including the IP addresses of endpoints as well as the NF instances. Search through these IP addresses to find out the one requested by the ARP request and respond with the corresponding MAC address.
5
Testing
For testing the implementation, you need to create a traffic profile with an increasing number of flows with time. For simplicity, you can divide testing period into multiple periods, and each period should have a certain number of active flows. You can create new flows in each period (flows don’t have to live beyond the period). The time-based profile of the traffic is shown in the following. 1
{ " profiles " : [ { " src_container " : " src 1 " , " dst_container " : " dst 1 " , " dst_ip " : " 1 4 5 . 1 2 . 1 3 1 . 9 2 " , " flows " : [ { " start_time " : 0 , " end_time " : 1 0 , " num_flows " : 5 } , { " start_time " : 1 0 , " end_time " : 3 0 , " num_flows " : 1 5 } ] }, { " src_container " : " src 2 " , " dst_container " : " dst 2 " , " dst_ip " : " 1 4 5 . 1 2 . 1 3 1 . 9 2 " , " flows " : [ { " start_time " : 0 , " end_time " : 1 0 , " num_flows " : 5 } , { " start_time " : 1 0 , " end_time " : 3 0 , " num_flows " : 1 5 } ] } ...
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
]
22 23
} Create a traffic generator that takes a configuration file (like above) as input and generates network traffic between the specified containers. During the period of the profile, generate scripts to increase the number of instances of particular NF in a particular NF chain. This can be done through a configuration file too.
4
6
Deliverables
6.1
Code
Include the code for all the previous sections. It should contain a README that explains the dependencies required to be installed for the code to work.
6.2
Test cases
Include the various traffic profiles that you tested the end-to-end system with.
6.3
Demo
For the demo, students are expected to have at least 2 chains deployed on the shared switch infrastructure. Chain 1 will use src1 to dst1 endpoints, chain2 will use src2 to dst2 endpoints. The flow of the demo would be as follows. 1. Setup the topology with all the Internal and External hosts and switches. 2. Register 2 chains. The first chain should be a FW-NAT chain. You can choose what the component functions of the chains are for the second chain - it can be only FW or only NAT or identical to the 1st chain. 3. Launch both chains with 1 instance of each component NF in each chain. 4. Execute the test profile script with a given number of flows. You can use the following template. For each chain, create 4 flows from the internal endpoint to external endpoint of that chain, and make these 4 flows last for 30 seconds. 5. After the 1st 30 seconds in the above test profile, add an additional instance of each NF in the first chain (FW-NAT). Wait for the containers are created. 6. Then use a new traffic profile wherein for each chain, create 8 flows from the internal endpoint to external endpoint of that chain, and make these 8 flows last for 30 seconds. We should be able to see that the new flows are passing through the new instances created by the scale command in Step 6 and that they follow connection affinity
6.4
Written report
Submit a report that outlines the implementation of the NFV Control Plane. The report should cover the following points : • Description of data structures used to maintain cluster state • Technique used for maintaining connection affinity in spite of a packet modifying NAT NF
5
System Issues in Cloud Computing Cloud System: Programming Frameworks (Part 1) KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
Setting the stage •
How to write large-scale parallel/distributed apps with scalable performance from multicores to clusters with thousands of machines? – Make the programming model simple • •
Liberate the developer from fine-grain concurrency control of the application components (e.g., threads, locks, etc.) Dataflow graph model of the application with application components (e.g., subroutines) at the vertices, and edges denoting communication among the components
– Exploit data parallelism •
Require the programmer to be explicit about the data dependencies in the computation
– Let the system worry about distribution and scheduling of the computation respecting the data dependencies •
•
Use the developer provided application component as the unit of scheduling and distribution
How to handle failures transparent to the app? – In data centers, it is not an “if”, it is a “when” question
Roadmap for this lecture Map-reduce Dryad Spark Pig Latin Hive Apache Tez
***NOTE TO VIDEO PRODUCER*** Slides for each of Map-reduce Dryad Spark Can be a **separate** video with that title The slides for the following three can be ONE video Pig latin Hive Tez
Roadmap for this lecture • • • • • •
Map-reduce Dryad Spark Pig Latin Hive Apache Tez
Map-Reduce Input + output to each of map + reduce ˗ pairs
Map-Reduce Input + output to each of map + reduce ˗ pairs
Example ˗ Emit # of occurrence of names in DOCs
- w = “Kishore” - w = “Arun” - w = “Drew”
Map-Reduce Input + output to each of map + reduce ˗ pairs
Example ˗ Emit # of occurrence of names in DOCs Key: file name value: contents
- w = “Kishore” - w = “Arun” - w = “Drew”
Map-Reduce Input + output to each of map + reduce ˗ pairs - w = “Kishore” - w = “Arun” - w = “Drew”
Example ˗ Emit # of occurrence of names in DOCs Key: file name value: contents
Map – look for unique names
Map
Map
Map
Key: unique name Value: number
Map-Reduce Input + output to each of map + reduce ˗ pairs - w = “Kishore” - w = “Arun” - w = “Drew”
Example ˗ Emit # of occurrence of names in DOCs Key: file name value: contents
Map – look for unique names
Map
Map
Map
reduce
reduce
reduce
Key: unique name Value: number
Map-Reduce Input + output to each of map + reduce ˗ pairs - w = “Kishore” - w = “Arun” - w = “Drew”
Example ˗ Emit # of occurrence of names in DOCs Key: file name value: contents
Map – look for unique names Reduce – aggregate
Map
Map
Map
reduce
reduce
reduce
Key: unique name Value: number
Why Map-Reduce? - Several processing steps in giantscale services expressible
Why Map-Reduce? - Several processing steps in giantscale services expressible - Domain expert writes * map * reduce
keyspace: urls
Why Map-Reduce? - Several processing steps in giantscale services expressible - Domain expert writes * map * reduce
Map
Map Keyspace: - unique targets
reduce
reduce
(target1, ) (targetn, ) ranks for pages target1 to targetn
keyspace: urls
Why Map-Reduce? - Several processing steps in giantscale services expressible - Domain expert writes * map * reduce
Map
Map Keyspace: - unique targets
reduce
reduce
- runtime does the rest * instantiating number of mappers, reducers * data movement
(target1, ) (targetn, ) ranks for pages target1 to targetn
Map-reduce Summary • Developer responsibility – Input data set – Map and reduce functions
• System runtime responsibility – Shard the input data and distribute to mappers – Use distributed file system for communication between mappers and reducers
Heavy lifting by Map-Reduce runtime Spawn workers Assign mappers
user program fork
master Auto split M split 0 split 1 split 2 split 3 split 4 Input files
M
Assign reducers Plumb mapper Reducers manage machines M + R > N
R
worker
worker read
worker
write
file 0
local write
worker worker map phase Intermediate files - read local disk on local disks - parse (R by each mapper) - call map
reduce phase - remote read - sort - call reduce
file 1 Output files
Issues to be handled by the Runtime Master data structures - Location of files created by completed mappers - Scoreboard of mapper/reducer assignment - Fault tolerance * Start new instances if no timely response * Completion message from redundant stragglers - Locality management - Task granularity - Backup tasks
Roadmap for this lecture • • • • • •
Map-reduce Dryad Spark Pig Latin Hive Apache Tez
Dryad design principles • Map-reduce aims for simplicity for a large-class of applications at the expense of generality and performance – E.g. files for communication among app components, two-level graph (map-reduce with single input/output channel)
• Dryad – General acyclic graph representing the app • Vertices are application components – arbitrary set of inputs and outputs
• Edges are application-specified communication channels (shared memory, TCP sockets, files)
Dryad: primitives • App developer writes subroutines • Uses graph composition primitives via C++ library to build the application
B
A
• Encapsulate • Specify transport for edges – Shared memory, files, TCP
…n….
…n…. A ^
A …n….
B
B
A
>> …n…. A A
B
>=
– Cloning, merging, composition, fork-join – Create a new vertex out of a subgraph
A
A
B …n…. || C
>= >=
B
…n….
…n….
B
C A …n…. A
>= D
>=
B || C >= A >=
Dryad system Application graph Name server
Job manager
Files, TCP, SM Data plane
C Control plane
C Compute nodes executing subgraph of application
C
Roadmap for this lecture Map-reduce Dryad Spark Pig Latin Hive Apache Tez
Roadmap for this lecture • • • • • •
Map-reduce Dryad Spark Pig Latin Hive Apache Tez
Data center programming challenges Need fault tolerance – Map-reduce approach • Use stable storage for intermediate results • Make computations idempotent
– Cons of this approach • Disk I/O expensive • Inhibits efficient re-use of intermediate results in different computations
reduce
map reduce
Spark design principles • Need performance and fault tolerance – Keep intermediate results in memory – Provide efficient fault tolerance for in-memory intermediate results
reduce map
MEM
reduce
RDD1
Spark secret sauce • Resilient Distributed Data (RDD) – In-memory immutable intermediate results – Fault tolerance by logging the lineage of RDD
T1
T1
T1
• i..e., the set of transformations that produces the RDD • RDD2 incremental scalability
•
Replication – Overlaid on consistent hashing to increase availability
•
Eventual consistency – Versioning and vector clocks to achieve consistency
Dynamo: summary of techniques Problem
Technique
Advantage
Partitioning
Consistent Hashing Vector clocks with reconciliation during reads
Incremental Scalability Version size is decoupled from update rates. Provides high availability and durability guarantee when some of the replicas are not available. Synchronizes divergent replicas in the background. Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information.
High Availability for writes Handling temporary failures
Sloppy Quorum and hinted handoff
Recovering from permanent failures
Anti-entropy using Merkle trees
Membership and failure detection
Gossip-based membership protocol and failure detection.
Cloud Storage Systems • • • • •
Amazon Dynamo Facebook Haystack Google Bigtable Facebook Cassandra Google Spanner
Haystack How to build a scalable/available photostore? • Good example of a problem-oriented design • Design point NOT for generality as in Amazon Dynamo but customization
Circa June 2010
Volume of User Interaction
April 2009
Current
Total
15 billion photos 60 billion images 1.5 petabytes
65 billion photos 260 billion images 20 petabytes
Upload Rate
220 million photos/week 25 terabytes
1 billion photos/ week 60 terabytes
Serving Rate
550,000/sec
1 million images/ sec
CDN
CDN
NFS
• Meta data bottleneck • Miss in CDN expensive
• Store multiple photos in file • Replaces Photo server and NAS Upshot • Metadata for “long-tail” photo in haystack cache • Photo itself may be in haystack-store
• Store multiple photos in file • Replaces Photo server and NAS Upshot • Metadata for “long-tail” photo in haystack cache • Photo itself may be in haystack-store
• Store multiple photos in file • Replaces Photo server and NAS Upshot • Metadata for “long-tail” photo in haystack cache • Photo itself may be in haystack-store
Haystack store – Log-structured file system 1. Request photo-upload 2. Find volume to write 3. Return load-balanced writable volume 4. Write to replicated physical volumes for availability – Append to end of volume
1. 2. 3.
Request photo- download Find logical volume Assign load-balanced physical volume to read 4. Browser gets the photo handle (needle) Directory decides the read path depending on longtail or not
• •
Direct to Haystack cache and then store, OR Via CDN
Haystack Summary • Haystack directory has the mapping given a photo URL – Haystack organized into logical volumes – Each photo has a marker called “needle” in the haystack – URL to needle mapping in the directory – Directory tells the browser whether to go to the CDN or the cache
Haystack Summary • Haystack cache responds to requests – Direct requests from the browser • Metadata for sure is likely in the cache if not bring from disk and cache
– Requests from the CDN => do not cache
• Haystack store – Log structured file system
Cloud Storage Systems • • • • •
Amazon Dynamo Facebook Haystack Google Bigtable Facebook Cassandra Google Spanner
Bigtable • What is the right abstraction for dealing with the data produced and consumed by a variety of Google projects – Web crawling/indexing, Google earth, Google finance, Google news, Google analytics, … • ALL of them generate LOTS of structured data • Interactive (human in the loop) and batch
Why a new abstraction? • Relational data model is too general – Does not lend itself efficient application-level locality management and hence performance
• stores are too limiting for multidimensional indexing into semi-structured data repositories • A new abstraction allows incremental evolution of features and implementation choices (e.g., in memory, disk, etc.)
Data model of Bigtable • Sparse, distributed, persistent, multi-dimensional, sorted data structure • Indexed by {row key, column key, timestamp} • “value” in a cell of the table is an un-interpreted array of bytes
Bigtable: Rows •
• •
Rows creation automatic on write to the table – Row name is an arbitrary string (upto 64KB in length) – Lexicographically ordered to aid locality management • E.g., reverse URL (hierarchically descending from DNS name resolution perspective) Row access is atomic Tablet – A range of rows, unit of distribution and load-balancing – Locality of communication for access to a set of rows
Bigtable: Columns •
Column – A row may have any number of columns – Column family has to be created before populating the columns with data • •
Two level name for column key: Once family created, any column key within the family can be used
– Examples • •
“Contents” stores web pages (so no additional qualifier) “anchor” is a family with 2 members in this table; “value” stored is the link text
Bigtable: Timestamps •
Associated with multiple versions of same data in a cell – Writes default to current time unless explicitly set by clients
•
Lookup options: – “Return most recent K values” – “Return all values in timestamp range (or all values)”
•
Column families can be marked w/ attributes: – “Only retain most recent K values in a cell” – “Keep values until they are older than K seconds”
Bigtable: API • • • • • •
Read operation Write operation for rows Write operations for Tables and Column families Administrative operations Server side code execution Map-reduce operations
Cloud Storage Systems • • • • •
Amazon Dynamo Facebook Haystack Google Bigtable Facebook Cassandra Google Spanner
Cassandra • Decentralized structured storage system • Data model similar to Bigtable – rows and column families with timestamps
• Dynamo’s ideas for distribution and replication – Uses the row key and consistent hashing (a la Dynamo) to distribute and replicate for load balancing and availability
Cloud Storage Systems • • • • •
Amazon Dynamo Facebook Haystack Google Bigtable Facebook Cassandra Google Spanner
Spanner – Focus on managing cross-datacenter replicated data – Automatic failover among replicas •
Fault tolerance and disaster recovery
– Evolved from Bigtable to support applications that need stronger semantics in the presence of wide-area multi-site replication – Globally distributed multiversion database • • •
Replication can be controlled at a fine grain Externally consistent reads and writes (given a timestamp) despite global distribution Globally consistent reads across the database
– Features • • •
Schematized Semi-relational data model Query language and general-purpose transactions (ACID) Interval-based global time
Resources for this Lecture •
NoSQL storage survey
•
Amazon Dynamo
– –
•
https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf
Google Bigtable –
•
http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf
Facebook Haystack –
•
http://www.christof-strauch.de/nosqldbs.pdf
http://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf
Facebook Cassandra –
http://dl.acm.org/citation.cfm?id=1773922
•
Google Spanner
•
Google File System:
– –
•
http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf
Microsoft Blob storage: –
•
http://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf
https://azure.microsoft.com/en-us/documentation/articles/storage-introduction/#blob-storage
Hadoop Distributed File System: – –
https://hadoop.apache.org/docs/r1.2.1/hdfs_design.pdf http://dl.acm.org/citation.cfm?id=1914427
System Issues in Cloud Computing Cloud System: Resource Management for the Cloud KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
Outline • • • • • •
Setting the context and terminologies Fair share schedulers Mesos: a fine-grained resource scheduler Hadoop YARN Google’s Borg resource Manager Mercury: a framework for integrated centralized and distributed scheduling
Outline • • • • • •
Setting the context and terminologies Fair share schedulers Mesos: a fine-grained resource scheduler Hadoop YARN Google’s Borg resource Manager Mercury: a framework for integrated centralized and distributed scheduling
Setting the context and terminologies •
Resources in the context of Cloud Apps – – – –
•
CPU Memory footprint and effective memory access time Storage (bandwidth, latency) Network (bandwidth, and latency)
Resource utilization – % time a resource is actively used
• •
An application may use all or most of the resources Begs the questions – Which resource’s utilization is more important? – Should we optimize for one or multiple resources’ utilization?
Setting the context and terminologies (contd.) • Basics – CPU scheduling in traditional OS • Focus on CPU utilization • FCFS, SJF, Round-robin, SRTF, priority queues, multi-level priority queues, … • Metrics used are turnaround time and response time, and variance of these metrics
• Fairness – User’s perception of resource allocation policy – e.g., round-robin gives a feeling of fairness if all processes are created equal • Each process gets 1/n processor resource, where n = number of processes
– What if all processes are not created equal? • “Fair sharing” takes into account priorities – You get what you pay for!
Setting the context and terminologies (contd.) • Variants of fair sharing – Equal share – Max-min fairness – Weighted max-min fairness
100
Equal Weighted Max-min 100 66
0
Max-min 100 60 20 0
0
Setting the context and terminologies (contd.) • How do we extend “Fair sharing” to the cloud? – Need to consider all the resources not just CPU
• Computations running in data centers are not all uniform – Data intensive workloads – Compute intensive workloads – Network intensive workloads
System Issues in Cloud Computing Resource Management for the Cloud: Fair Share Schedulers KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
Outline • • • • • •
Setting the context and terminologies Fair share schedulers Mesos: a fine-grained resource scheduler Hadoop YARN Google’s Borg resource Manager Mercury: a framework for integrated centralized and distributed scheduling
Examples of fair share scheduling for Cloud Hadoop – Uses max-min scheduling • K-slots per machine (CPU + memory) – A slot is a fraction of the machine’s resources » E.g., 16-CPUs, 1 GB DRAM: with 16 slots, each slot = 1-CPU plus 64 MB DRAM
• A job may consist of a number of tasks • Assign one task per slot • Apply max-min fairness in mapping tasks to slots
– Could result in under-utilization of slot resources or over-utilization (e.g., memory thrashing) • E.g., if a task needs 128 MB, it will lead to thrashing in the above example; if it needs only 32 MB then half the allocation is wasted
Examples of fair share scheduling for Cloud (Contd.) •
Dominant Resource Fairness (DRF) – Takes a holistic view of a machine’s resources • E.g.: Total resource of a machine: 10 CPUs and 4 GB • Task 1 needs: 2 CPUs, 1 GB – Dominant resource of Task 1 is memory – If Task 1 is allocated a slot on this machine its “dominant resource share” is 25%
– DRF uses max-min fairness to dominant shares • • • •
E.g., System: 9 CPUs+18 GB A: each task 1 CPU + 4 GB (dominant: memory) B: each task 3 CPU + 1 GB (dominant: CPU) Equalize allocation for A and B on dominant shares – Each gets 2/3 of the dominant share
Outline • • • • • •
Setting the context and terminologies Fair share schedulers Mesos: a fine-grained resource scheduler Hadoop YARN Google’s Borg resource Manager Mercury: a framework for integrated centralized and distributed scheduling
Challenge for resource sharing in the Cloud • Applications use a variety of frameworks – Map-reduce, Dryad, Spark, Pregel, …
• Apps written in different frameworks need to share the data center resources simultaneously – 1000s of nodes, hundreds of “jobs”, each with millions of (small duration) tasks
• Need to maximize utilization while meeting QoS needs of the apps
Traditional approach to resource sharing • Static partitioning (cluster with 9 machines)
Optimal approach • Global schedule – Inputs • requirements of frameworks, resource availability, policies (e.g., fair sharing)
– Output • Global schedule for all the tasks
– Not practical • Too complex – Not scalable to the size of cluster and number of tasks – Difficult to implement in the presence of failures
Mesos goal • Dynamic partitioning leading to high utilization 100% 50% 0%
Shared cluster
Mesos approach: fine-grained resource sharing • A thin layer (a la micro-kernel), allocation module between the scheduler of the frameworks and the physical resources
MPI job
Hadoop job
MPI scheduler
Hadoop scheduler Resource offer
Mesos master
Mesos slave
Allocation module
Mesos slave
MPI executor
MPI executor
task
task
Pick framework to offer resources to
Mesos: Resource offers • Make offers to the frameworks • Framework assigns tasks based on the offer • Framework can reject the offer if it does not meet its needs • Solution may not be optimal but resource management decisions can be at a fine grain and quick
Why does it work? • Popular frameworks can scale up/down elastically – “resource offers” match this elasticity
• Task durations in most frameworks are homogeneous and short to facilitate loadbalancing and failure recovery – If offer does not match, reject, and wait for the right offer
Outline • • • • • •
Setting the context and terminologies Fair share schedulers Mesos: a fine-grained resource scheduler Hadoop YARN Google’s Borg resource Manager Mercury: a framework for integrated centralized and distributed scheduling
Hadoop YARN • Similar goal to Mesos – Resource sharing of the cluster for multiple frameworks
• Mesos is “offer” based from allocation module to the application frameworks • YARN is “request” based from the frameworks to the resource manager
YARN background • Traditional Hadoop (open source implementation of Map-reduce) – Job-tracker/Tasktracker organization – Poor cluster utilization • Distinct map/reduce slots
Task Tracker Task
Task
Client
Client
Job Tracker
Task Tracker Task
MapReduce Status Job Submission
Task
Task Tracker Task
Task
YARN Details • • • • • • •
Client submits app to RM RM launches AM in a container AM registers with RM at boot-up, requests Acs to RM RM allocates ACs in concert with the NM AM contacts the NM to launch its Acs Upon application completion, AM shuts down YARN can implement different scheduling policies
RM: Resource Manager NM: Node Manager AM: Application Master AC: Application Container
Node Manager Container
Client
Resource Manager
Client
Node Manager App Mstr
– FCFS, Fair, Capacity MapReduce Status Job Submission Node Status Resource Request
App Mstr
Container
Node Manager Container
Container
System Issues in Cloud Computing Resource Management for the Cloud: Google’s Borg Resource Manager KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
Outline • • • • • •
Setting the context and terminologies Fair share schedulers Mesos: a fine-grained resource scheduler Hadoop YARN Google’s Borg resource Manager Mercury: a framework for integrated centralized and distributed scheduling
Google’s BORG • This is Google’s resource manager • Similar in spirit to Yahoo’s YARN and Mesos – Provide data center resources for a variety of applications, written in a variety of frameworks, running simultaneously Search Gmail Map Reduce Maps Dremel Borg
Datacenter resources (CPU, memory, disk, GPU, etc.)
Borg terminologies • Cluster – a set of machines in a building
• Site – Collections of buildings
• Cell – a subset of machines in a single cluster – Unit of macro resource management
Borg priority bands • Production jobs – Higher priority jobs (e.g., long-running server jobs)
• Non-production jobs – Most batch jobs (e.g., web crawler)
• Production jobs higher priority • Non-overlapping “priority bands” – Monitoring, production, batch, best-effort
Borg Architecture and scheduling BorgMaster Per cell resource manager
Config File
Borglet
commandline tools
borgcfg
web browsers
Agent process per machine in a cell
Scheduler High-low priority jobs Round robin within a band Consults BorgMaster for resources BorgMaster Feasibility checking Scoring
Tasks Run in containers
Job (Name, Owner, Constraints, Priority)
cell Tasks 1…N (Resources, Index)
scheduler
Borglet
Borglet
Borglet
Borglet
Kubernetes • An open-source resource manager derived from Borg – Runs on Google cloud – Uses Docker containers • Resource isolation • Execution isolation
Outline • • • • • •
Setting the context and terminologies Fair share schedulers Mesos: a fine-grained resource scheduler Hadoop YARN Google’s Borg resource Manager Mercury: a framework for integrated centralized and distributed scheduling
Mercury background • Centralized schedulers – E.g., Mesos, YARN, Borg – Pro • Simplifies cluster management via centralization
– Con • Single choke point => scalability and latency concerns
• Distributed schedulers – Independent scheduling decision making by jobs • Allows jobs to directly place tasks at worker nodes => less latency
– Con • Not globally optimal
Mercury insight • Combine virtues of both centralized and distributed – Resource guarantees of “centralized” – Latency for scheduling decisions of “distributed”
• Application choice – Trade performance guarantees for allocation latency
Mercury Architecture • Central scheduler – Policies/guarantees – Slower decision making
• Distributed scheduler – Fast/low-latency decisions
• AM specifies resource type – Guaranteed containers – Queueable containers
Mercury Architecture Mercury Resource Management Framework Mercury Coordinator Distributed Scheduler
App Master
Distributed Scheduler
Central Scheduler
resource type
Mercury Runtime
Mercury Runtime
Mercury Runtime
Resources for this Lectures Dominant Resource Fairness https://people.eecs.berkeley.edu/~alig/papers/drf.pdf • Mesos https://people.eecs.berkeley.edu/~alig/papers/mesos.pdf • Hadoop YARN http://www.socc2013.org/home/program/a5-vavilapalli.pdf • Large-scale cluster management at Google with Borg http://static.googleusercontent.com/media/research.google.com/en//pubs/archi ve/43438.pdf • Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters https://www.usenix.org/system/files/conference/atc15/atc15-paperkaranasos.pdf •
Concluding remarks • Most technologies used in the cloud are NOT fundamentally new. They are well-known distributed computing technologies. • Resource management and scheduling is an age-old problem. However, the scale at which data center schedulers have to work, the different stakeholders it has to cater to, and achieving the twin goals of performance and resource utilization make building such a distributed system software a non-trivial engineering exercise.
C OLLEGE OF C OMPUTING , G EORGIA I NSTITUTE OF T ECHNOLOGY
Workshop 6/Systems Workshop 1: Master Node in Map Reduce
In this module of the class, you are going to implement the base code for a fault-tolerant Master in the MapReduce framework. Additionally, you are going to create the handlers, interfaces, and scoreboard required for the Master. You will be using docker containers as nodes, with Go or C++ as your implementation language, and Kubernetes to orchestrate the whole thing.
1 E XPECTED OUTCOME The student will learn about: • The data structures associated with the Master of the MapReduce framework. • Implementing remote procedure calls (RPC) to execute code on remote computers (virtual machines), using the library gRPC. • Leader election using etcd/zookeeper. • Run framework on containers built with Docker and deploy the containers using Kubernetes. Specifically, you will: 1. Develop gRPC client and server applications in the C++ or Go programming languages. 2. Implement leader election in the applications using the distributed data store etcd or zookeeper. 3. Develop the applications to run on containers built with Docker and deploy the containers using Kubernetes.
1
2 A SSUMPTIONS This workshops assume that the student knows how to program in C++ or Go. The student is using a computer with Ubuntu as the operating system (or a virtual machine).
3 B ACKGROUND I NFORMATION This section goes through some basic concepts in Kubernetes, Helm and Kind that would be helpful for this module. It also briefly walks through setting up Go in your environment (applicable to those who choose to implement in Go). If you are familiar with these technologies, feel free to skip to the next section.
3.1 K UBERNETES In the NFV workshop, you used Docker containers as nodes in a network, utilizing it as a lightweight VM. While this is sufficient for running single containers and a non complex system, for a distributed system that needs multiple containers running at once with replication, failures, and communication between each other, we would need some system to coordinate and orchestrate that.(A specific example of this would be the orchestrator you’ve built for the NFV project). This is where Kubernetes comes in. Kubernetes is a service that manages automatic deployment, scaling, and management of containerized applications across a distributed system of hosts. For those who are unfamiliar with Kubernetes, it is crucial to understand how Kubernetes models an application.
Figure 3.1: Kubernetes Abstraction The figure above shows a rough diagram of how Kubernetes functions. The lowest level of granularity in Kubernetes is a pod. Pods can be thought of as a single "VM" that runs one or more docker containers. The images for containers ran in pods are pulled from either a public, private, or local container registry. You can think of a container registry as a repository
2
of docker images. Each physical node in a Kubernetes cluster can run multiple pods, which in turn, can run multiple docker containers. For simplicity, we recommend running a single docker container in a pod for this module. Developers can connect to a kubernetes cluster using the kubectl command line tool. Once connected, developers can deploy their application on the cluster via the command line and a YAML configuration file.
Figure 3.2: Kubernetes Objects While Figure 3.1 explained the basic abstraction of a Kubernetes cluster, Kubernetes defines different objects that wraps around the basic concept of pods, and are used in the YAML configuration file to setup your application. Figure 3.2 illustrates the Service, Deployment, and Replica Set objects. A replica set defines a configuration where a pod is replicated for a number of times. If a pod in a replica set dies, the Kubernetes cluster will automatically spawn a new pod. A deployment object is a more generalized object that wraps around Replica sets, and provides declarative updates to Pods along with a lot of other useful features. In general, Replica Sets are not explicitly defined in a kubernetes configuration file, a deployment object that specifies the number of replicas for pods will automatically set up a replica set. Finally, a Kubernetes service object can connect to multiple deployment. Since pods can fail and new replica pods can be added in a deployment, it’d difficult to interact with your application with only deployments. A kubernetes service acts as a single point of access to your application, and thus all underlying pods are clustered under a single ip. There may be times when there is a need to access each pod underneath a service via its unique ip address. A headless-service can be used in such cases. For this module, you will define your own kubernetes YAML file for mapreduce. More information on how to actually write the YAML file can be seen in this document, or this youtube video. We recommend reading through the workload and services sections of the Kubernetes document.
3
3.1.1 H ELM AND K IND Now that you have a basic understanding of Kubernetes, we’ll introduce to you two different Kubernetes technologies that you will use in this module. Helm is a package manager for Kubernetes. You can think of Helm as the "apt-get of Kubernetes". Using helm, you can add public repositories of Kubernetes applications, which contain ready-built kubernetes applications configs, known as "charts". You can then deploy one of these public charts directly onto your own Kubernetes cluster. We will use Helm to deploy an etcd or ZooKeeper Kubernetes service onto our cluster. Kind is a local implementation of a Kubernetes cluster. Since Kubernetes is designed to run on a cluster of multiple hosts, it is somewhat difficult to work with locally since you only have one host. Some clever developers have figured out a way to simulate a distributed environment using Docker called Kubernetes in Docker(KIND). KIND will be used as your local kubernetes cluster for you to test your code. Helm and Kind will be installed using the provided install script.
3.2 G OLANG 3.2.1 G O I NSTALLATION For this module, we will be using go version 1.17. You will have to install Go on your local development machine using this link. 3.2.2 G O M ODULES AND D EPENDENCY M ANAGEMENT S YSTEM Go’s dependency management system might be a bit confusing for new developers. In Golang, both packages and modules exist. A package in Golang describes a directory of .go files, whereas a module describes a collection of Go packages that are related to each other. Go modules allow users to track and manage the module’s dependency and versioning, as well as allows users to create local packages to use within the module. In any directory, you can create a go module like so:
$ cd $ go mod init The command above will initialize the module in the directory and add a go.mod file, which tracks the name, version, and dependencies of the module. If you want to install a dependency to your module, you can either directly install like so:
$ cd $ go get Or you can import the package in your .go files with the assumption that it has already been installed, then run:
$ go mod tidy
4
This will automatically add the dependency into your go.mod file. Developers can choose to publish their modules to the public, and install modules from the public that they can use within their own module. For this workshop and subsequent mapreduce workshops, a mapreduce module should be created in your local directory, with master, worker and other related packages within this. The mapreduce module will depend on public grpc and etcd modules. This will be done by the install script as noted in section 6. More information on go modules can be found here.
3.3 U SEFUL R EFERENCES • A Tour of Go • Install Go • Go Modules • Concurrency in Go by Katherine Cox-Buday • C++ Tutorial • Thinking in C++ 2nd Edition by Bruce Eckel • Modern C++ by Scott Meyers • Kubernetes Concepts • Kubernetes YAML video
4 S PECIFICATION Using Kubernetes, Docker, and etcd or ZooKeeper you are going to implement a Fault-Tolerant Master node on your local machine.
5 D OWNLOAD R EPO AND I NSTALLATION 5.0.1 C++ U SERS Download C++ starter code using the following.
$ $ $ $ $
sudo apt-get update sudo apt-get install git mkdir -p ~/src cd ~/src git clone https://github.gatech.edu/cs8803-SIC/workshop6-c.git
The downloaded git repository contains a bash script to install all the required dependencies. Run as follows:
5
$ cd ~/src/workshop6-c $ chmod +x install.sh $ ./install.sh 5.0.2 G O U SERS Download Go starter code using the following.
$ $ $ $ $
sudo apt-get update sudo apt-get install git mkdir -p ~/src cd ~/src git clone https://github.gatech.edu/cs8803-SIC/workshop6-go.git
The downloaded git repository contains a bash script to install all the required dependencies. Run as follows:
$ cd ~/src/workshop6 $ chmod +x install.sh $ ./install.sh
6 I MPLEMENTATION This workshop has four phases: 1. Setting up C++ or Go Applications 2. Create the required data structures for the Master. 3. Use GRPC for creating the RPC calls. 4. Implement Leader Election with etcd or ZooKeeper 5. Building Containers 6. Deploying Containers with Kubernetes
6.1 S ETTING UP A G RPC A PPLICATION The first thing you are going to do is create a simple gRPC application. Please follow the C++ quickstart gRPC or Go quickstart gRPC docs to guide your development. Specifically, we would like you create the gRPC server in the worker node and put the gRPC client in the master node. The server in the worker node should receive a string and return the string plus gatech. For example, if the input is hello, the server should return hello gatech. When the worker receives the call, it should log the input received using glog. There are a couple of these out there, just choose one that you like.
6
If you are interested in learning how to setup your Go application directory structure in a standard way, please use this link. Please test your binaries to make sure that they function correctly before moving onto the next section.
6.2 I MPLEMENTING L EADER E LECTION Next, once the gRPC server and client have been created and can successfully exchange information, you are going to implement leader election. If you have opted to use C++, you have the freedom to choose between using either etcd or Zookeeper for your leader election implementation. If you’re using Go, we advise using etcd as it has been tried and tested. 6.2.1 U SING ETCD - C++ AND G O If you are using etcd, we recommend that you read about the API in the C++ etcd docs or etcd docs, and follow C++ blog posts or Go blog posts for how to implement leader election. Understand what is happening under the hood, it will be discussed during the demo. In addition to the master nodes, you should also think about how you are registering your worker nodes. You don’t need to run an election for them, but saving some information in etcd might be a good idea. We recommend looking into etcd leases to potentially help with saving information for workers. Why could this be useful? Unfortunately, at this point in time you will not be able to test your code unless you start a local etcd cluster. If you would like to make sure your leader election works before proceeding to the next section, be our guest! We are certain you will learn something by setting up etcd to run locally on your machine. 6.2.2 U SING Z OOKEEPER - C++ ONLY Selecting a leader is a really complex problem in distributed systems, luckily there are now frameworks that allow us to implement it more easily for certain scenarios (like this one). We are going to use Zookeeper to implement the leader election. First, read about how Zookeeper works in here. An explanation (recipe) for implementing the leader election can be found here. The directory that is going to contain all the master nodes is going to be call /master. Zookeeper works as a standalone service. Your Master code should connect to it using the C Binding, to facilitate this we are going to use a C++ wrapper that was already installed in the previous script. The git for the repo can be found here. There are examples of how to use it (and compile it) in the directory examples/. Once the leader is elected, then it needs to replicate each local state change to all the followers using RPC (to be implemented in following workshops). It is not until all the followers responded back that a state change can be committed. We are also going to use Zookeeper to keep a list of the available worker nodes, using ephemerals nodes in the directory /workers (what is an ephemeral node?). Additionally, to know which master replica to contact we are going to use the directory /masters and use the value with the lowest sequence (as explained in the Recipe). If there is a scenario in which the
7
wrong master was contacted, it should reply back with and error and the address of the correct Master leader.
6.3 B UILDING C ONTAINERS You can create your containers however you like, but we will give you hints on how to do it with Docker. 1. Create two Dockerfiles, Dockerfile.master and Dockerfile.worker. Dockerfile Reference. 2. Create a build script, ./build.sh, in the root directory. 3. What does your build script need to do? Here are some suggestions: a) Generate your gRPC binaries b) Generate your application binaries c) Build your docker images 6.3.1 B UILDING C ONTAINERS WITH C++ Write this portion. It should contain information about statically or dynamically linking libraries in your containers. In C++, the container setup will be a little more heavyweight than with Go. You have two choices to create C++ containers with your applications. 1. Statically Linking Applications 2. Dynamically Linking Applications(recommended) The install script should build both the static libraries and dynamic libraries for all dependencies, including ZooKeeper and etcd. It is your choice whether you build a container with the static or dynamic libraries. Currently, it is recommended that you copy the necessary shared libraries into your container.
6.4 D EPLOYING C ONTAINERS WITH K UBERNETES Now that you have the docker images and binaries set up, it’s time to build your Kubernetes application. As noted in section 3, you will be using KIND to set up a local Kubernetes cluster. 1. Get KIND set up. 2. Create a cluster. 3. Write a Kubernetes YAML Deployment file that will deploy 2 master pods and 1 worker pod(We recommend you using one service and two deployments) 4. Deploy your application to the local KIND cluster 5. Get logs for your pods.
8
Now is where the fun kicks in, it’s time to start wrestling with Kubernetes. Useful commands can be found below: 1. Create a Kubernetes namespace to specify where things should be.
$ kubectl create ns 2. Use Helm to install etcd chart onto your cluster
$ helm repo add bitnami https://charts.bitnami.com/bitnami $ helm install -n etcd bitnami/etcd --set auth.rbac.enabled=false 3. Use Helm to install zookeeper chart onto your cluster
$ helm repo add bitnami https://charts.bitnami.com/bitnami $ helm install bitnami/zookeeper 4. Load Docker images to Kind
$ kind load docker-image $ kind load docker-image 5. Deploy your application
$ kubectl -n apply -f .yaml 6. Helpful kubectl commands:
$ kubectl get all -n $ kubectl -n logs pod/ $ kubectl -n delete pod/ Hints: You may need to pass dynamic variables into your master, worker pod replicas (IP address, ETCD/ZooKeeper endpoints information etc). You can do this by setting environmental variables of pods in your Kubernetes files and status files.
7 U SEFUL R EFERENCES • RPC Paper
8 D ELIVERABLES Git repository with all the related code and a clear README file that explains how to compile and run the required scenarios. You will submit the repository and the commit id for the workshop in the comment section of the upload. You will continue using the same repository for the future workshops.
9
8.1 D EMO The demo for this workshop is as follows (to discuss with other students): You should be able to demo leader election using kubectl. This means: 1. Initiate a master deployment with two pods. One of the masters should be elected as leader. Do an RPC call to the only worker with its address as the input. i.e. it should receive back gatech. 2. Resign the current master. Now, the second master should become the new master and do an RPC call to the worker with its address as the input. 3. The initially resigned master should rejoin the election again. Kubernetes should automatically restart the initially resigned master. 4. Log output should be relevant to the resign/rejoin of the master (i.e, print the hostname or podname unique to each master pod). Worker should properly log the input string along with the leader from which the request was received.
10
C OLLEGE OF C OMPUTING , G EORGIA I NSTITUTE OF T ECHNOLOGY
Workshop 7/Systems Workshop 2: MapReduce File System
In this module of the class, you are going to implement the required algorithms to handle the data movement in the execution of a Map-Reduce application.
1 E XPECTED OUTCOME The student is going to: • Design and implement the base interface and functionalities for the underlying file system of Map Reduce. It should handle the movement of data between the map and reduce phases. Additionally from the reduce phase to the final result. • Implement the algorithm that is going to shard the data and distributed among the M available resources.
2 A SSUMPTIONS This workshops assume that the student knows how to program in C++ or Go and has completed the previous workshops of this class.
3 B EFORE Y OU S TART If you’re using C++, download the git repository for this workshop and run the script that is going to install the libraries to use Azure blob storage. If you’re using Go, skip this section and refer here to install go package for Azure blob.
1
$ $ $ $ $ $
mkdir -p ~/src cd ~/src git clone https://github.gatech.edu/cs8803-SIC/workshop7.git cd workshop7 chmod +x install_library.sh ./install_library.sh
3.1 U SEFUL P ROGRAMMING R EFERENCES • C++ Tutorial • Thinking in C++ 2nd Edition by Bruce Eckel • Modern C++ by Scott Meyers • Go Packages • An Introduction to Programming in GO by Caleb Doxsey
4 S PECIFICATION Using Azure Blob storage the student is going to implement the required interface and functionalities for handling the files and intermediates results for Map Reduce. The system should be able to: • Distribute and give access to the input files to the mappers, sharding the input files as needed. • Distribute the pairs to the corresponding reducer. • Extend the functionality implemented in the previous workshop to be able to distribute the data from the master to the workers.
4.1 U SEFUL A ZURE B LOB R EFERENCES • Azure Blob storage • How to use Blob Storage from C++ • How to use Blob Storage from Go
5 I MPLEMENTATION 5.1 L EARNING THE BASICS First, you need to learn the basics of using Azure Blob Storage. Based on your language choice, follow the tutorial How to use Blob Storage from C++ or How to use Blob Storage from Go and understand how to upload, retrieve and delete data from Azure Blob.
2
5.2 W HAT ABOUT GFS? Read again the MapReduce paper and extract the required interfaces that the master and workers need from the underlying filesystem. Write down a specification document that details the required interactions between master, workers, and the file system. Discuss your specification with the other students.
5.3 I MPLEMENTING THE INTERFACE Some of the assumptions in the Map Reduce paper don’t apply to the infrastructure that you are using for this workshop: We conserve network bandwidth by taking advantage of the fact that the input data (managed by GFS [8]) is stored on the local disks of the machines that make up our cluster. GFS divides each file into 64 MB blocks, and stores several copies of each block (typically 3 copies) on different machines. The MapReduce master takes the location information of the input files into account and attempts to schedule a map task on a machine that contains a replica of the corresponding input data. Jeffrey Dean and Sanjay Ghemawat, MapReduce (2004) Our system has no control on the location of the files that are being used. Additionally, the system is not required to minimize the bandwidth being used, given the bandwidth that is now available in the Azure infrastructure. The framework should strive to reduce the number of queries and the complexity of retrieving the data for each of the phases. Based on this new set of assumptions, implement the interfaces that you found on section 5.2 using the Azure Storage Client Library for C++ or Azure Storage Client Library for Go. 5.3.1 A ZURE B LOB S TORAGE If you haven’t, take an hour to read the original MapReduce paper. This paper is the foundation of the code you will be implementing over the next two weeks. Before you start coding, plan out the interface your project is going to need with Azure Blob Storage. We would like you to: • Be able to upload your input files to blob storage from the MapReduce client. • Be able to read the input shards in the mapper worker nodes. • Be able to write the intermediate output files of the mapper worker nodes to Azure Blob Storage. • Be able to read the intermediate output files into the reducer worker nodes. • Be able to write the final result files to Azure from the reducer worker nodes.
3
5.3.2 S UGGESTIONS • Log often using any logging package of your choice so you can debug easier later. Use the different levels of logging (INFO, WARN, DEBUG, ERROR), such that the final executable is not slow due to the logging excessively. Make sure only relevant logs are displayed during demo. • Learn how to access ranges of the blobs saved in the storage, think how you can use this for implementing the master functionalities.
5.4 S HARDING The next thing we would like you to do is implement a sharding algorithm. This sharding algorithm should: 1. Be able to accept multiple input files. 2. Have a configurable minimum and maximum shard size.(Optional) 3. Try to spread the input equally across all the mappers. Keep in mind you cannot assume that the master can hold the entire dataset in memory or in storage. Does it make sense to run the whole sharding algorithm in the master? Think about how sharding responsibilities should be distributed between the workers and the master. You can assume for the module that shard size will be based on the number of map tasks. So, M=20 would represent 20 shards.
6 D ELIVERABLES • The git repo that contains all the required code and commit id. • A specification document that explains the required interfaces that MapReduce expects from the underlying file system. • Here is a list of the expected steps in the Demo for next week: – Explain the main interfaces that the master and worker required from the filesystem (GFS). Keep this explanation under 2 minutes. – Explanation of the sharding algorithm for splitting the input file * The team should go over the code and explain the key components. * This should be limited to a 5 minutes explanation, please use your minutes wisely. – Demo: * Save the two largest files in the Gutenberg Dataset to Azure Blob Storage.
4
* The master node should shard these files and distribute the shard information to the worker nodes via gRPC. * The (mapper) worker nodes should read these shards into memory, append hello to them, and then save each shard back to Azure. * The (reducer) worker nodes should read the map output shards into memory, append gatech to them, and then save each shard back to Azure. * The student should demonstrate that this works using Azure Storage Explorer.
5
C OLLEGE OF C OMPUTING , G EORGIA I NSTITUTE OF T ECHNOLOGY
Workshop 8/Systems Workshop 3: Worker Task Execution
In this module of the class, you are going to implement the required code to execute the map and reduce tasks on the worker. Use the code created in the previous workshop as a base for the implementation.
1 E XPECTED OUTCOME The student is going to: • Deploy MapReduce applications running in Kubernetes to AKS. • Develop a user interface to upload input files, mapper and reducer functions to Azure blob storage. • Create working mapper and reducer functions that execute user-submitted Python code inside the worker nodes. • Design and implement the interfaces and functionalities for the execution of the map and reduce phases in the workers.
2 A SSUMPTIONS This workshop assumes that the student had successfully completed all the previous workshops on this module; and the corresponding assumptions for those workshops.
1
3 D OWNLOAD R ELEVANT E NVIRONMENT T OOLS Install the Azure CLI.
4 S PECIFICATION Your MapReduce implementation should be able to: • Deploy your MapReduce cluster to Azure Kubernetes Service(AKS) • Execute map/reduce phase in any worker • After executing the map phase, sort the mapper result in place and store it in the corresponding location. • Store the required information in the master to be able to fetch the required pairs to execute the Reduce phase in the corresponding worker. • Store the final results into Azure Blob, you should be able to use this data as an input for a pipelined map/reduce computation.
5 I MPLEMENTATION 5.1 D EPLOYING RESOURCES You would need to deploy your kubernetes cluster(which has been running locally up till now) to Azure Kubernetes Service. You would need to create an AKS instance and use kubectl to deploy. Please consult the AKS Docs. Right now your container images are only available locally, you would need to push your images to Azure using Azure Container Registry, and configure your Kubernetes deployment to use the correct images. Please consult the ACR Docs. You can configure access to both your local cluster running in KIND and to your Azure deployment via the kubeconfig. Please use these docs to learn more. 5.1.1 U SEFUL LINKS • Kubernetes Walkthrough • Configure Kubernetes KIND Cluster in Azure
5.2 U SER I NTERFACE FOR M AP R EDUCE Creating a good user interface for software is an important aspect of developing a successful product. In this section, you are going to develop an interface to our MapReduce service. This interface can be developed using a standard protocol such as HTTP or gRPC.
2
This interface should be the single endpoint on the master or a separate interface pod that is exposed to the public network and allows user to submit (POST) a mapreduce job. It is up to you to specify how this interface works. Think about what kind of information you are going to need to POST and how you might use HTTP to transfer that information. The user themselves will upload the information to azure blob, and trigger the job through the user interface. The framework itself should not be involved in uploading the information to Azure Blob Storage for consumption by the MapReduce job. 5.2.1 C++ We recommend checking out the http netlib library to accomplish this goal. gRPC is also a suitable(and already developed) option, if necessary. 5.2.2 G O We recommend checking out the http package to accomplish this goal. 5.2.3 R EADINESS P ROBES If you did not set up Kubernetes readiness probes in the first week, we recommend that you do so now. It will be a simple addition to this interface. 5.2.4 U SEFUL LINKS C++ • c++ netlib library GO • http package
5.3 P YTHON C ODE W RAPPER The map and reduce functions are going to be implemented in Python. The Python script receives each input value through the standard input and writes the key value pairs through the standard output. Your worker functions need to be able to feed the inputs as stdin to the Python scripts, start the execution of the code, and capture the output of the Python script. 5.3.1 C++ O PTION A The map and reduce components are going to be implemented in Python, similar to the first workshop of this course. The python script receives each input value through the standard input and writes the key values through the standard output. Your code needs to be able to both feed the inputs to the python script and save the results from the output of the python script and start the execution of this programs. To be able to accomplish this task you
3
are going to use four functions: pipe, execl and fork, and dup2. Using these functions you are going to implement a bidirectional pipe to communicate with the python code. S UGGESTIONS • Discuss with the other students about the corner and error cases that can arise when using the four suggested functions, how do we avoid deadlock scenarios? and how do we handle these situations?. O PTION B Another possibility for implementing the python function call is Extending python with C. In which we use the file Python.h functions to call the python function directly from C++. S UGGESTIONS • Discuss with the other students about the benefits and drawbacks of using either option A or option B, and potentially suggest other options. You are free to choose a different way to run the mapper and reducer, but be sure to analyze the cons and pros of your solution. U SEFUL LINKS • Calling Python Functions from C • execl(3) - Linux man page • pipe(2) - Linux man page • fork(2) - Linux man page • Piping for input/output • Creating pipes in C • Popen 5.3.2 G O We recommend checking out the os exec package to accomplish this goal. Determine different design decisions for these calls, such as timeouts, proper output parsing, etc.
5.4 S AVE INTERMEDIATE RESULT Using the API created in the previous workshop, save the output created by the map phase into the intermediate storage, there should be R outputs created. The structure of this intermediate storage is going to depend on the specification file presented as a deliverable for the previous workshop.
4
5.5 S UGGESTIONS • Discuss with other students about ways to store the intermediate results. Should it be in blob storage or local storage of the workers? Should there be M*R outputs in total files, or only R output files using atomic append operations? If you are using local files are you using Linux commands like scp to copy the files or are you using RPC connections to the workers?
5.6 S AVE FINAL RESULT Using the API created in the previous workshop save the output generated from the reduce phase into the final location. 5.6.1 S UGGESTIONS • Your framework should be able to use it as an input to a pipeline of map reduce executions.
6 D ELIVERABLES • The git repo that contains all the required code and commit id. • A demo that shows: – Deploy your system to Azure Kubernetes. – Configure your kubectl cli to point to the Azure cluster. – Demonstrate your ability to scale your worker and master nodes via the kubectl cli. – Submit a job via the user interface to your cluster. – Show the output of your MapReduce job.
7 U SEFUL R EFERENCES • MapReduce paper
5
C OLLEGE OF C OMPUTING , G EORGIA I NSTITUTE OF T ECHNOLOGY
Systems Workshop 4/Project: Final Implementation Refinements
In this module of the class, you are going to implement the final details for a fully functional map reduce framework. Use the code created in the previous workshops as a base for the implementation.
1 E XPECTED OUTCOME The student is going to: • Implement a heartbeat between the master and workers. • Handle worker failure. • Replication of Worker data.
2 A SSUMPTIONS This workshop assumes that the student had successfully completed all the previous workshops on this class; and the corresponding assumptions for those workshops.
3 S PECIFICATION Your MapReduce implementation should be able to: • Periodically ping the workers to obtain its status. This has to be implemented using GRPC with the ping being initiated by the master leader.
1
• Have the master verify completion, after a timeout, and if required restart the task in a different worker. • Have the master replicate the local data structures.
4 I MPLEMENTATION 4.1 H EARTBEAT “ The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed ... any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling.” Jeffrey Dean and Sanjay Ghemawat, MapReduce (2004) For this section of your code, you are going to implement an RPC between the master and each of the workers. Any reply is fine for the RPC because the information implied by being able to respond back. The masters want to verify that all the workers are healthy and working. Create a periodic function that pings each of the workers, use the GRPC Deadline function to define a deadline for the calls. All the ping calls need to be done in parallel, so remember to use locks for the data structures that contain the status of the workers. The heartbeat may not be required depending on your communication patterns. The intention behind the heartbeat (and this needs to be accomplished in your project) is to ensure that a specific task (be it mapper or reducer) is successfully completed. Etcd tells you that the worker is alive, but it doesn’t help you with a specific task’s progress. In MapReduce, there is a well-known problem of stragglers, where a specific task is slow in a worker for many reasons. Heartbeat is a way to accomplish this by having a timeout. You could also implement this as a timeout as part of your gRPC call. If your call is waiting for the task to be completed, be really careful with the timeout you are setting the gRPC to. If it is too short, long tasks (mainly reducers ones) may never finish, and your system may be stuck. E.g., if your timeout is 1 minute, and your task would take 2 minutes, your system would continuously reassign the task without ever being able to finish. Each task being performed has a timestamp associated with the time it started executing in any worker and the expected finish time for the execution. You are going to implement a periodic function that checks for any expired execution and change the status of the task to be idle so it can be rescheduled. The timing for this operation does not need to be tight, so periodically checking for all the task being processed is acceptable if the period of time is much lower than the execution time. Use a method similar to the one used in 4.1. 4.1.1 U SEFUL RESOURCES
4.2 M ASTER D ATA STRUCTURE R EPLICATION The last step to having a fully functional MapReduce framework is to replicate the state of the Master. After the previous workshops, your master set is able to elect a new leader in case of
2
failure; now we need to replicate the data structures from the leader to the followers. You are going to use one of the two methods below, or some combination of the methods below, to perform the replication:
Etcd/ZooKeeper: Etcd/ZooKeeper can be used to stored a limited amount of information in a replicated matter and only after the failure the next leader would read the data structures from etcd/ZooKeeper.
RPC: The master leader is going to start an RPC message to each follower with the contents of the data structure to be modified. Each time a data structure needs to be modified, any request to the master that depends on this data structure needs to wait until the data has been persisted and then it would complete the corresponding call (RPC in this scenario). You need to define which of this two strategies you are going to use to replicate the data structures and remember that etcd/ZooKeeper may look easier, but it can only contain a limited amount of information, you need to define theoretical bounds to the amount of information to be stored. If it is in the order of O(M) or O(R) that may be too high for Etcd/ZooKeeper (M and R being the MapReduce , but if it is in the number of O(W), W being the number of workers in the system, that is still acceptable. Additionally think about which queries are in the critical path, for our purposes doing an RPC to the followers is faster than sending a request to Etcd/ZooKeeper You need to create a document of at most 1 page, which explains how you handle each data structure that you are using and why you select one way or the other, and implement the specification. 4.2.1 S UGGESTIONS • Discuss with your fellow students, what approaches you are going to follow to complete the replication of Master state, and what is the reason behind your decision?
4.3 W ORKER S CALE O UT Depending on the size of the job, it’s possible that you would want to scale out the number of workers.
5 D ELIVERABLES • The git repo that contains all the required code. • The explanation of the replication decisions. • A demo that shows: – The log on the master with the heartbeat being performed and the status of each of the workers.
3
– Starting a map reduce execution and killing a worker in the middle of either the map or reduce phase, and showing the logs of the master selecting a new worker to execute the corresponding task. Finally show that the result of the execution is correct. – The logs in which the master leader replicate the data structure between to the followers. – The decisions made for replicating the master data structures.
5.1 E XTRA D ETAILS • Add an extra parameter "fail" to the worker executable. The "fail" parameter is an integer with value n that indicates that the worker will exit after receiving its n-th task, but before executing it. A value of 0 indicates that it would continue forever. A value of 2 indicates, that the worker would complete its first task, and fail after receiving the second task. • Similarly, the master should also have an extra "fail" parameter. The "fail" parameter indicates after how many assignments of tasks to workers will the master exit(or spin depending on your implementation). • It is OK for the master to delete its node before exiting, but no further clean-up should be performed. • Modify your client to receive as a parameter the container in blob storage to be used in the map-reduce task. Depending on your current implementation in the client, you may need to change the client to download the list of files from blob storage and pass them to the master. • The "fail" parameter is used to induce a failure in the components to evaluate how your frameworks handle failures, and it works like a kill-switch for the corresponding component. 5.1.1 B EFORE THE D EMO • Download the Gutenberg dataset: Gutenberg Dataset • Split the dataset into two datasets: – A small one with 25 MB worth of files (approximately the first biggest 6 files). – A big one with 1 GB worth of files • Upload each dataset to a different container. • Have a running framework with: – At least 6 workers with the fail parameter as 0. – At least 1 Master with a fail parameter as 0.
4
– You are free to use as many masters or workers as you’d like as long as you include the required master and worker nodes. – A special worker and master with a non-zero "fail" parameter could be created as a "different" kubernetes deployment with its own configuration. Please have the yaml spec ready to start the deployment after the parameter is assigned. The deployment yaml should have a settable "fail" parameter for 1 master and 1 worker to be specified by the TA. – After setting up all of them, you should have 7 workers and 2 masters. 5.1.2 D URING THE D EMO • Start the prepared cluster (one worker and one master) with the "fail" parameter given by the TA. • Explain the decisions made for replicating the master data structures. Keep this explanation under 2 minutes. • Explain your design for the timeout/heartbeat for detecting task failure/stragglers. Keep this explanation under 2 minutes. • Run the word counter application with the small 25 MB dataset with M = 20 and R = 5. – Show how it handled the failure of the master. – Show how it handled the failure of the worker. • Run the word counter application with the big 1 GB dataset with M = 100 and R = 20. Show some of the output files.
6 U SEFUL R EFERENCES • MapReduce paper
5
C OLLEGE OF C OMPUTING , G EORGIA I NSTITUTE OF T ECHNOLOGY
Workshop #1: MapReduce in Azure
In this module of the class, you are going to implement well-known applications on top of Azure Map Reduce.
1 E XPECTED OUTCOME The student will learn about: • The code structure of Map Reduce applications. • The main components of the Map Reduce framework. • How to upload files into Azure Blob storage. • How to manage and control HDInsight clusters when running MapReduce tasks.
2 A SSUMPTIONS This workshops assume that the student knows the basics of Python programming.
2.1 U SEFUL R EFERENCES • Python Tutorial
1
3 C REATING A HDI NSIGHT C LUSTER One way to run MapReduce applications on Azure is to use an HDInsight cluster. In this section of the workshop you are going to create your own cluster, following the tutorial: Create Linux-based clusters in HDInsight using the Azure portal. For the cluster type select:
Cluster Type Hadoop Operating System Linux Version 2.7.0 Cluster Tier Standard For the data source:
Storage account storage Default container container Location East US Cluster AAD Identity Leave not configured For the pricing:
Worker Nodes 4 Worker node size A3 (remember to press View all) Head node size A3 Finally create the resource group as resource. It may take up to 20 minutes to instantiate a new HDInsight cluster. If you want to learn more about Hadoop in HDInsight, look at these articles: • What is Hadoop in the cloud? An introduction to the Hadoop ecosystem in HDInsight • Create Linux-based Hadoop clusters in HDInsight • HDInsight documentation
4 RUN THE EXAMPLES CODES Now we are going to run one sample application that is already located in the cluster. Run one of the applications shown in the tutorial Run the Hadoop samples in HDInsight. We suggest you run the wordcount, but you can run any that seems interesting to you.
2
5 U PLOAD YOUR FILES Most of the time the data that you want to be processed is not located initially in Azure Storage. We are going to use the following dataset that contains information about ratings of movies: MovieLens Dataset. You need to upload the required files using the following tutorial: Upload data for Hadoop jobs in HDInsight. Once you upload your file using the azure CLI, then you will be able to access it as follows:
wasbs://@.blob.core.windows.net/ You can learn more about Azure Blob Storage: How to use Blob Storage from C++ and Azure Blob storage with Hadoop in HDInsight
5.1 S UGGESTION Upload the small MovieLens Latest Datasets first so you can keep working on the rest of the workshop, and try to upload the big datasets just once for the final tests.
6 I MPLEMENTATION Now we are going to implement three different map-reduce applications, that are going to be used as a pipeline to process the MovieLens dataset. First, understand how to program MapReduce applications using Python by reading the following tutorial: Develop Python streaming programs for HDInsight. More information can be found in the tutorial: Writing an Hadoop MapReduce Program in Python
6.1 C ALCULATE THE AVERAGE OF THE MOVIES The first map reduce program that you are going to implement is going to calculate the average of each movie in the dataset. The output of this stage should be [, ]. The average should be calculated with two decimal points.
6.2 J OIN The second task is to join the movie id with the actual movie name. Movies ids are of no use for people reading the statistics that you are generating, we need to join the two datasets to have a more useful output. The input to this task are the two files: u.item and the output of the section 6.1. The output of this task should be [, ]. You need to implement this efficiently, e.g. do not cache all items in each worker. 6.2.1 U SEFUL R EFERENCES • Map Reduce Joins
3
6.3 S ORT THE AVERAGE OF THE MOVIES The last task is to sort the movies by their average. The input to his stage is the output of section 6.2. The output of this stage should be [,] in a decreasing order of average value.
7 D ELIVERABLES After you have finished this workshop, you should have the following documents completed: • The map and reduce source files for each of the three stages. • The output of each of the stages. Share your results with other students and compare the values obtained. For the next workshop, you need to read and understand thoroughly the MapReduce paper included in the references. Additionally, try to answer the following questions: • What are the main functionalities that the master has to support in order to distribute the work on the workers? • What happen when there are slow machines in the system? Discuss your answers with other students.
8 R EFERENCES • Map Reduce paper
4
System Issues in Cloud Computing Cloud Applications: Developing Large Apps for the Cloud KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
App developer’s toolkit on the Cloud • Frameworks – json, j2ee, .NET, REST, …
• Webserver – Apache, MS IIS, IBM HIS, …
• Appserver – Websphere, Weblogic, …
• DB server – MySQL, SQL, DB2, Oracle, …
• NoSQL DBs – MongoDB, Hadoop, ….
App developer’s toolkit on the Cloud • Frameworks – json, j2ee, .NET, REST, …
• Webserver – Apache, MS IIS, IBM HIS, …
• Appserver – Websphere, Weblogic, …
• DB server – MySQL, SQL, DB2, Oracle, …
• NoSQL DBs – MongoDB, Hadoop, ….
• Development Environment – Eclipse, Rational, Visual Studio, …
• Configuration/Deployment – Chef, Puppet, …
• Lifecycle management – CVS, Subversion, …
• Testing – LoadRunner, Rational, …
System Issues in Cloud Computing Cloud Applications: Best Practices
KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
Best Practices for Scalable Cloud App Development • • • • •
Stateless design Loose coupling of app components Async communication Choice of DB Design patterns
Best Practices for Ensuring Reliability and Availability • • • • •
Ensuring no single point of failure Automate failure actions Graceful degradation Liberal use of logging Replication
Best Practices for Ensuring Security • Authentication • Authorization – Identity Management and Access Management
• Data Security – Stored and in-flight data
• Encryption and Key management • Auditing
Best Practices for Ensuring Performance • Identify metrics of importance – E.g., response time, throughput, false positives, false negatives
• Workload modeling • Stress testing – Identify bottlenecks
Best Practices for Endurance • Design for testing and verification – Built in mechanisms for independently measuring the health of components – Automated mechanisms for gathering and reporting health and performance statistics
• Design for future evolution – Incremental addition of new features – Incremental testing of upgrades – Automated maintenance
System Issues in Cloud Computing Cloud Applications: New normal
KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
Why Cloud has become the new normal? • Confluence of several factors – – – – – – – – – –
Distributed computing Grid computing to democratize HPC use by scientists WWW Gigabit networks Appearance of clusters as a parallel workhorse Virtualization Enterprise transformation Supply chain B2B commerce Business opportunity for computing as a utility service
Why Cloud has become the new normal? • Concrete evidence of success – Driving down costs and accuracy • E.g.: FDA – manual reports to machine readable reports » 99.7% accuracy » Cost down from $29 per page to $0.25 per page
– Data security • Organizations can benefit from the “sledge hammer” of Cloud service providers rather than internal IT teams
– Spurring innovation • Cost of experimentation is small
System Issues in Cloud Computing Cloud Applications: Designing for scale
KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
Designing for Big Scale • Gap between expectation and reality – Expectation: Cloud resources are infinite – Reality: Cloud resources have finite capacity • Number of CPUs, network bandwidth, etc.
• Implications – Design for “scale out” – Capacity planning is crucial – Carefully understand “scale unit” for your app • Multiple smaller deployments • E.g., one instance for every 100 core
– Automate provisioning resources commensurate with the scale unit for your app
A real world example • Azure’s solution for financial risk modeling • Pathological example – Cost of insuring the entire world population • Model: stochastic analysis of the insurance cost • Simulation time: 19 years on a single core!
– Azure’s solution • Without changing the programming model runs on 100,000 geodistributed cores
System Issues in Cloud Computing Cloud Applications: Migrating to the cloud KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
Challenges for Cloud Migration • Service parameters – Latency, uptime/availability – Security – Legal issues
• Enterprise services are complicated – N-tier (front-end, business logic, back-end) – Replication and load balancing – Intra-org firewalls
• Solution – Hybrid implementation • Private plus public
– Construct a model of cost/benefit of migrating each component
Resources for this Lectures • Building blocks for Cloud Apps https://apievangelist.com/2014/03/11/common-building-blocks-of-cloud-apis/ • Cloud is the new normal http://www.washingtonpost.com/sf/brand-connect/amazonwebservices/cloud-is-the-new-normal/ • Designing for Big Scale https://blogs.msdn.microsoft.com/kdot/2014/10/09/designing-for-big-scale-in-azure/ • What would you do with 100,000 cores https://azure.microsoft.com/en-us/blog/what-would-you-do-with-100000-cores-big-compute-at-global-scale/ • Cloud Design Patterns from Microsoft https://www.microsoft.com/en-us/download/details.aspx?id=42026 • AWS reference architectures https://aws.amazon.com/architecture/ • Migrating to the cloud https://www.microsoft.com/en-us/research/publication/cloudward-bound-planning-for-beneficial-migration-ofenterprise-applications-to-the-cloud/ • Cloud Computing: A Hands-On Approach http://www.cloudcomputingbook.info
Additional Resources Case studies in cloud computing • https://www.gartner.com/doc/1761616/case-studies-cloud-computing • https://www.aig.com/content/dam/aig/americacanada/us/documents/brochure/iot-case-studies-companies-leading-theconnected-economy-digital-report.pdf • https://www.troyhunt.com/scaling-standard-azure-website-to-380k/ Hybrid Clouds • http://www.intel.com/content/www/us/en/cloud-computing/cloudcomputing-what-is-hybrid-cloud-101-paper.html • https://aws.amazon.com/enterprise/hybrid/ • https://www.microsoft.com/en-us/download/details.aspx?id=30325
CLOUD DESIGN PATTERNS EXAMPLES (MICROSOFT)
https://www.microsoft.com/en-us/download/details.aspx?id=42026
Competing Consumers Pattern (Page 28)
Federated Identity Pattern (Page 67)
Health Endpoint Monitoring Pattern (Page 75)
Pipes and Filters Pattern
Scheduler Agent Supervisor Pattern (Page 132)
AWS reference app structures
3-tier Architecture
Advertisement serving (http://media.amazonwebservices.com/architecturece nter/AWS_ac_ra_adserving_06.pdf)
Time Series Processing (http://media.amazonwebservices.com/architecturece nter/AWS_ac_ra_timeseriesprocessing_16.pdf)
Content Delivery (http://media.amazonwebservices.com/architecturece nter/AWS_ac_ra_media_02.pdf)
Data Intensive (http://media.amazonwebservices.com/architecturece nter/AWS_ac_ra_largescale_05.pdf)
System Issues in Cloud Computing Cloud Applications: Resiliency, Upgrades, elasticity KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
System Issues in Cloud Computing Cloud Applications: Resilience
KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
Resilience and replication • Key to resilience is having redundant replicas • Given failures, how do we know which replicas are up to date? • Replica management is fundamental to cloud computing – Replicated resource managers • E.g., Azure Service Fabric, Google’s Borg, Apache’s Mesos
– Replicated storage servers • E.g. Google BigTable, Oracle NoSQL storage server
Types of Failures • Byzantine – Upon failure start sending spurious information to healthy nodes
• Fail stop – Upon failure just shut up!
How much redundancy is needed? • Byzantine – To tolerate “t” failures, we need 2t+1 replicas
• Fail stop – To tolerate “t” failures, we need t+1 replicas
State maintenance with replicas • We want update progress in the presence of failures – Allow updates to happen without waiting for ALL copies to ack
• Quorum consensus protocols – Read • “r” copies
– Write • “w” copies
– If N is total number of servers, for correctness • Qr+Qw > N – ensures read quorum and write quorum overlap
• Qw+Qw > N – Ensures that there is no concurrent update to the same data item
Paxos algorithm (due to Lamport) • An elegant formulation of quorum consensus system – Roles played by the members of the system • Proposer, coordinator, acceptor, learner
• Proposer – Think of this as an application command to perform an action that is “durable”: e.g., write (x, “v”)
• Coordinator (assume there is only one, works even if there are many) – Phase 1: send “prepare to accept” • Receive “ready to accept” ACKS, or “not accept” NACKs from Acceptors
– Phase 2: send “accept” with the actual command • Command is performed by the Learners who execute the command
– If insufficient ACKS, then go back to Phase 1
Paxos in use • First used in DEC SRC Petal storage system • With the advent of Cloud computing wide use – – – – – –
Google Borg Azure Service Fabric Facebook Cassandra Amazon web services Apache Zookeeper ……
System Issues in Cloud Computing Cloud Applications: Upgrades
KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
Availability basics • Availability – Fraction of time system is up – Availability may be subject to contractual obligations • E.g.: an ISP may contractually promise 99.9% uptime in the SLA for apps
Availability Vs. Upgrade Conundrum • Software upgrades – Rolling upgrade usually recommended to avoid service downtimes • Applicable only if changes maintain backward compatibility
• Upgrades in general – Affects availability guarantees – Needs to be carefully orchestrated • Upgrade agility for competitive edge • Respecting SLAs for business apps
System Issues in Cloud Computing Cloud Applications: Elasticity
KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
Elasticity • Provisioning and de-provisioning system resources in an autonomic manner • Dimensions – Speed • Time to switch from under-provisioned to optimal, or • Over-provisioned to optimal
– Precision • Deviation of new allocation from actual demand
Implementing Elasticity • Proactive cyclic scaling – Scaling at fixed intervals (daily, weekly, monthly, quarterly)
• Proactive event-based scaling – Scaling due to expected surges (e.g., product launch, marketing campaigns)
• Auto-scaling on demand – Monitor key metrics (server utilization, I/O bandwidth, network bandwidth) and trigger scaling
Application Tuning for Elasticity • Identify app components or layers that can benefit from elastic scaling • Design chosen app components for elastic scaling • Evaluate the impact of overall system architecture due to elasticity
Resources for this Lectures •
Paxos Made Simple (Leslie Lamport) –
•
Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial –
•
https://www.umiacs.umd.edu/~tdumitra/papers/MESOCA-2011.pdf
Elasticity in Cloud Computing: What It Is, and What It Is Not –
•
https://www.cs.cornell.edu/fbs/publications/SMSurvey.pdf
Cloud Software Upgrades: Challenges and Opportunities –
•
http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf
https://www.usenix.org/system/files/conference/icac13/icac13_herbst.pdf
Best practices for elasticity – –
https://aws.amazon.com/articles/1636185810492479 https://media.amazonwebservices.com/AWS_Cloud_Best_Practices.pdf
System Issues in Cloud Computing Cloud Applications: Emerging Applications KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
Opening Headshot Welcome to the third lecture in this mini course, which explores emerging applications in the Cloud. We start the discussion with streaming apps on structured data, and then move on to discussion of apps that work on amorphous streaming data. We conclude the lecture with Internet of Things and situation awareness apps that can well turn out to be a new disruption to the utility model of cloud computing.
System Issues in Cloud Computing Cloud Applications: Streaming Applications with Structure Data KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
Stream processing fundamentals • Raw stream data (e.g., stock prices, tweets, facebook posts, news feeds, etc.) pushed to the cloud • App in the cloud organizes the raw data into meaningful events • App entertains continuous queries from users (e.g., sending stock ticker, twitter feeds, facebook feeds, etc.)
Stream processing of structured data • Aurora and Medusa (MIT, Brown, Brandeis) – Structured data streams as input data – Queries built from well-defined operators • E.g., select, aggregate • Operator accepts input streams and produces output streams
– Stream processor • Schedules operators respecting QoS constraints
• Medusa is a federated version of Aurora • TelegraphCQ (Berkeley) – Similar in goals and principles to Aurora
Concrete Example: tweets • •
Real-time streaming Input – Raw tweets
•
Output – Tweed feeds to users
• •
Scale with number of tweets and number of followers Compute real time active user counts (RTAC) – Measure real-time engagement of users to tweets and ads
•
“Computation on the feeds” – Data transformation, filtering, joining, aggregation across twitter streams – Machine learning algorithms over streaming data •
Regression, association, clustering
– Goals •
Offer user services, revenue, growth, search, content discovery
Heron architecture for twitter • Topology: Application graph (DAG) – Vertices: computation – Edges: data tuples streams – Typically 3-8 stages of processing steps
• Architecture – Central scheduler accepts topologies – Launches topologies on to Zookeeper cluster resources using Mesos • Each topology runs as a Mesos job with multiple containers – Master, stream manager, metrics manager, Heron instances (app logic)
System Issues in Cloud Computing Cloud Applications: Fault tolerance for Streaming Applications KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
General requirements for streaming apps • Real time processing • Persistent state • Out of order data • Constant latency as you scale up • Exact once semantics Not invent for each app but can we have a generic framework for MANY if not ALL apps? • Spam filtering • Intrusion detection • Ad customization • ….
MillWheel •
Applications considered as directed graph – User defined topology for connecting the nodes – Nodes are computations – Edges are “intermediate results” called records delivered from one computational component of the app to the next
•
Focus of MillWheel is fault tolerance at the framework level – Any node or any node can fail without affecting the correctness of the app – Guaranteed delivery of records – API guarantees idempotent handling of record processing App can assume that a record is delivered “exactly once”
– Automatic checkpointing at fine granularity
•
MillWheel (just like map/reduce for batch jobs) handles all the heavy lifting for streaming jobs – Scalable execution of the app with fault tolerance – Dynamically grow and shrink the graph
System Issues in Cloud Computing Cloud Applications: Streaming Applications with Amorphous Data KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
Stream processing of amorphous data • Stream-based programming models – IBM System S, Slipstream, Stampede, TargetContainer
Stream processing of amorphous data • Stream-based programming models – IBM System S, Slipstream, Stampede, TargetContainer
Stream processing of amorphous data • Stream-based programming models – IBM System S, Slipstream, Stampede, TargetContainer
Stream processing of amorphous data • Stream-based programming models – IBM System S, Slipstream, Stampede, TargetContainer
Stream processing of amorphous data • Stream-based programming models – IBM System S, Slipstream, Stampede, TargetContainer
Stream processing of amorphous data • Stream-based programming models – IBM System S, Slipstream, Stampede, TargetContainer
System Issues in Cloud Computing Cloud Applications: Internet of Things and Situation Awareness KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
Rise of Internet of Things
A Broad Set of IoT Applications Energy Saving (I2E) Defense
Predictive maintenance
Enable New Knowledge
Industrial Automation
Intelligent Buildings
Enhance Safety & Security Transportation and Connected Vehicles
Healthcare
Agriculture
Smart Home Smart Grid
Thanks to CISCO for this slide
Future Internet Applications on IoT • Common Characteristics • Dealing with real-world data streams • Real-time interaction among mobile devices • Wide-area analytics
• Requirements • Dynamic scalability • Low-latency communication • Efficient in-network processing
Infrastructure Overhead
Cognitive Overhead Infrastructure Overhead
Cognitive Overhead Infrastructure Overhead
False Positives False Negatives
Concluding Thoughts • Emerging Applications pose new challenges • Streaming apps on the Cloud thus far – Human in the loop – Latency at “human perception” speeds
• Future – Machine to machine communication • E.g., Connected cars
– Latency at “machine perception” speeds
Resources for this Lectures • • • • • • • • • •
Scalable Distributed Stream Systems http://cs.brown.edu/research/aurora/cidr03.pdf IBM System S http://researcher.watson.ibm.com/researcher/view_group_subpage.php?id=2534 SLIPstream: Scalable Low-latency Interactive Perception on Streaming Data http://www.cs.cmu.edu/~rahuls/pub/nossdav2009-rahuls.pdf Stampede: a cluster programming middleware for interactive stream-oriented applications https://pdfs.semanticscholar.org/0778/689991f69e06164a58e2917f5b84d935282b.pdf TargetContainer: A target-centric parallel programming abstraction for video-based surveillance http://ieeexplore.ieee.org/document/6042914/ Large-scale Situation Awareness with Camera Networks and Multimodal Sensing https://pdfs.semanticscholar.org/e9e2/15b9910c8009e80d8edd77f3bbcce9666207.pdf MillWheel: Fault-Tolerant Stream Processing at Internet Scale http://static.googleusercontent.com/media/research.google.com/en/pubs/archive/41378.pdf Trill: A High-Performance Incremental Query Processor for Diverse Analytics - Focus on the capacity to handle both batch and real-time https://www.microsoft.com/en-us/research/publication/trill-a-high-performance-incremental-query-processor-for-diverse-analytics/ Twitter Heron https://dl.acm.org/citation.cfm?id=2742788
Introduction to the Array of Things
http://www.niu.edu/azad/_pdf/3-Michael_May18_2016.pdf
Closing Headshot Cloud computing started out as offering computational services at scale for throughput oriented apps (such as search, ecommerce). To this day, that is the bread and butter of cloud computing. Streaming apps on structured data is the next wave we are currently seeing (facebook, twitter, snapchat, etc.). Streaming content (e.g.. YouTube, Netflix) adds another dimension to Cloud computing. However, the next wave, namely, IoT, can be disruptive since situation awareness apps demand stringent latency constraints.
System Issues in Cloud Computing Cloud Applications: Trending Topics
KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
Opening Increasingly large-scale deployments of IoT platforms for emerging applications is causing a potential disruption in the Cloud ecosystem. In this lecture, several trending topics such as Fog Computing, and programming systems and analytics that take into consider geo-distributed computing infrastructures will be discussed.
Outline • Fog computing • Foglets: Exemplar for Geo-distributed Programming System • Jetstream: Exemplar for Backhaul Streaming Support • Iridium: Exemplar for Geo-distributed Wide Area Analytics
System Issues in Cloud Computing Cloud Applications: Fog Computing
KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
Rise of Internet of Things
A Broad Set of IoT Applications Energy Saving (I2E) Defense
Predictive maintenance
Enable New Knowledge
Industrial Automation
Intelligent Buildings
Enhance Safety & Security Transportation and Connected Vehicles
Healthcare
Agriculture
Smart Home Smart Grid
Thanks to CISCO for this slide
Realities • • • •
IoT IoT IoT IoT
devices heterogeneous (sensing modality, data rate, fidelity, …) devices not ultra reliable testbed multi-tenancy apps latency sensitive (sensing -> processing -> actuation)
Need of the hour… • System software support for situation awareness in large-scale distributed IoT testbed Requirements for system software infrastructure • Support multiple sensing modalities • Dynamic resource provisioning • QoS for the apps in the presence of sensor mobility
Limitations of Existing Cloud (PaaS) • Based on large data centers High latency / poor bandwidth for data-intensive apps • API designed for traditional web applications Not suitable for the future Internet apps
Why? • IoT and many new apps need interactive response at computational perception speeds • Sense -> process -> actuate
• Sensors are geo-distributed • Latency to the cloud is a limitation • Besides, uninteresting sensor streams should be quenched at the source
Future Internet Applications on IoT • Common Characteristics • Dealing with real-world data streams • Real-time interaction among mobile devices • Wide-area analytics
• Requirements • Dynamic scalability • Low-latency communication • Efficient in-network processing
Fog Computing • New computing paradigm proposed by Cisco • Extending the cloud utility computing to the edge • Provide utility computing using resources that are • Hierarchical • Geo-distributed Mobile / Sensor Edge Core
Pros of Fog Computing • Low latency and location awareness • Filter at the edge and Send only relevant data to the cloud • Geo distribution of compute, network, and storage to aid sensor processing
Pros of Fog Computing • Low latency and location awareness • Filter at the edge and Send only relevant data to the cloud • Geo distribution of compute, network, and storage to aid sensor processing • Alleviates bandwidth bottleneck
C l o u d
Challenges with Fog computing • Manage resources. • Deployment. • Program the system. • Bursty resource requirements.
System Issues in Cloud Computing Cloud Applications: Foglets – Programming System Exemplar KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
What is Foglets? • Automatically discovers nodes. • Elastically deploy resources. • Collocate applications. • Communication and storage API. • Resource adaptation.
Foglets • PaaS programming model on the Internet of Things • Design Goals • Simplicity: minimal interface with a single code base • Scalability: allows dynamic scaling • Context-awareness: network-, location-, resource-, capability-awareness
• Assumes Fog computing infrastructure • Infrastructure nodes are placed in the fog • IaaS interface for utility computing
Network Hierarchy
Foglets – Application Model • Foglets application consists distributed processes connected in a hierarchy • Each process covers a specific geographical region
Location
Foglets – API (App Deployment) Foglets Process App Code Runtime
Start_App()
AppKey Region Level Capacity
On_create()
Foglets On_new_child()
On_child_leave()
On_create()
Connect_Fog(appkey)
On_new_parent()
Discovery and Deploy Protocol
Foglet
Foglet
Entry Point
Entry Point
Foglet Entry Point
Join Ready Accept
Foglet Worker Registry/Discovery
{Busy, Ready, Ready List of available Entry PingPoints Send Up/Down Get Entry Start Get Docker Application Container Image Deploy Points }
Foglets – API (Communication)
on_message()
send_to()
Foglets Communication Handlers void OnSendUp(message msg) { .. } void OnSendDown(message msg) { .. } void OnReceiveFrom(message msg) { .. }
How To Respond to the Events? Handlers • • • • •
On On On On On
Send Up Send Down Receive From Migration Start Migration Finish
Foglets – API (Context-awareness) Level 0
query_location() query_level() query_capacity() query_resource() Level 3
Level 1 Level 2
Camera, speed, etc. CPU, RAM, storage
Level 2
Foglets – Spatio-temporal Object Store • App context object is tagged by key, location, and time • get_object(key, location, time) put_object(key, location,time) • Context objects are migrated when scaling
F3 F2
B (F4)
time
F4
A (F2) C (F5)
F5
B space
A
Spatial Distribution
old
F1
C
new
F3 = B + C F1 = A + B + C
Temporal Distribution F2
A B C
F4 F5
F1 F3
F1 now – 3 mins
F1 F3
F1
3 mins – 5 mins 5 mins – 8 mins
Foglets – Scalability • Application scales at runtime based on the workload • Developer specifies scaling policy for each level • Load balancing based on geo-locations
Level 0
Level 1
App Context
Level 2
send_up()
send_up()
send_up()
Level 2
send_up()
Join Protocol
Foglet
Foglet
Foglet
Rejected
Foglet
Foglet
Migration Protocol Entry Point
Entry Point
Entry Point
Registry/Discovery Entry Point
Entry Point Send Up/Down
Send Up/Down
Get Best Send Data Save Local Migrate Send Accept Migration List of Send Ping Get Receive Neighbor Neighbors Neighbors Ping to Change parent Range State State Persistent Migration Message Neighbors migrate Structure
Foglets framework • Alleviates pain points for domain experts in Mobile sensing applications (e.g., vehicle tracking, self-driving cars, etc.) • QoS sensitive placement of application components at different levels of the computational continuum from the edge to the cloud • Multi-tenancy on the edge nodes using Docker containers • Dynamic resource provisioning commensurate with sensor mobility • Discover and deploy, join, migration protocols
System Issues in Cloud Computing Cloud Applications: Jetstream – Exemplar of Backhaul for Streaming Analytics KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
Jetstream System • Streaming support – Aggregation and degradation as first class primitives
• Storage processing at the edge – Maximize “goodput” via aggregation and degradation primitives
• Allows tuning quality of analytics commensurate with available backhaul bandwidth – Aggregate data at the edge when you can – Degrade data quality if aggregation not possible
Jetstream storage abstraction • Updatable – Stored data += new data
• Reducible with predictable loss of fidelity
–Data => data • Mergeable – Data + Data = Merged Data
• Degradable – Local data => dataflow operators => approx. data data flow operators
System Issues in Cloud Computing Cloud Applications: Iridium – Exemplar of Geo-distributed Analytics KISHORE RAMACHANDRAN, PROFESSOR School of Computer Science College of Computing
Iridium Features • Analytics framework spans distributed sites and the core • Assumptions – Core is congestion free – Bottlenecks between the edge sites and the core – Heterogeneity of uplink/downlink edge from/to core
Problem being solved by Iridium • Given a dataflow graph of tasks and data sources – Where to place the tasks? • i.e., destination of the network transfers
Sites M Tasks N Data matrix (MxN) Upload BWs Download BWs
– Where to place the data sources? • i.e., sources of network transfers
– Approach • Jointly optimize data and task placement via greedy heuristic – Goal: minimize longest transfer of all links
• Constraints – Link bandwidths, query arrival rate, etc.
Task 1 -> London Task 2 -> Delhi Task 3 -> Tokyo …….
Resources for this Lectures • •
• •
Fog Computing: A Platform for Internet of Things and Analytics http://link.springer.com/chapter/10.1007%2F978-3-319-05029-4_7 Foglets: Incremental Deployment and Migration of Geo-distributed Situation Awareness Applications in the Fog http://www.cc.gatech.edu/projects/up/publications/DEBS-2016-foglets.pdf JetStream: Streaming Analytics in the Wide Area https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-rabkin.pdf Iridium: Low-latency Geo-distributed Data Analytics http://research.microsoft.com/en-us/UM/people/srikanth/data/gda_sigcomm15.pdf
Closing The next wave, namely, IoT, is a likely disruptor of the cloud ecosystem. It needs a rethink of the software infrastructure of the geo-distributed computational continuum from the edge to the core. This includes questions pertaining to the programming models, storage architecture, and analytics spanning the continuum. We have covered some of the new technologies in this whirlwind lecture and provided the resources for further exploration on your own. This concludes the last lecture in this mini course on Cloud Applications and future trends in cloud computing.
C OLLEGE OF C OMPUTING , G EORGIA I NSTITUTE OF T ECHNOLOGY
Module 4: Project Proposal
In this module of the class, you will read about other Azure services and define your project.
1 E XPECTED OUTCOME The student is going to: • Learn about different services available in Azure. • Define the topic and reach of the project.
2 A SSUMPTIONS This workshop assumes that the student has some familiarity with the Azure portal.
3 A ZURE SERVICES To define a good project, you first need to understand the tools that are available for you to use. Go to the azure services webpage and read about the different services that you could use for your project. Find the three services that you found more interesting and write a one-paragraph summary of each.
4 A ZURE SERVICE TUTORIAL For one of the three selected services, look on the available tutorials in the azure webpage and discuss it with other students.
1
5 P ROJECT TOPIC You need to write a description of your project, including deliverables, technologies to be used, and the roadmap to complete your project before the end of this class successfully. The project should at least use two Azure services that are not containers or virtual machines.
6 D ELIVERABLES • A summary of the three services that you find more interesting. • A report describing the deliverables and the technologies to be used on the project. Share these documents with your fellow students and discuss the technologies that you researched, and try to answer the following questions: • Is it the right tool to develop your project? Is there any other better option? • How much would it cost to develop the project? Use the Azure Calculator to help you in your estimates. • How can you improve your project to be either more compelling to the final user or improve your project’s architecture?
2
C OLLEGE OF C OMPUTING , G EORGIA I NSTITUTE OF T ECHNOLOGY
Apps Workshop 2: Azure Web Apps
In this module of the class, you are going to deploy an auto-scalable web application and test its performance.
1 E XPECTED OUTCOME The student is going to: • Learn about web applications deployment in Azure. • Use PaaS services and configure elastic resources allocation. • Test the scalability of services.
2 A SSUMPTIONS This workshop assumes that the student has some familiarity with the Azure portal and knowledge in one of the following languages: java, node.js, PHP or ASP.NET. Additionally, you should have the Azure CLI and git already installed on your local computer.
3 A ZURE A PP S ERVICE Azure App Service is a Platform as a Service (PaaS) framework that allows you to host web applications. These applications are run by the Azure system in managed virtual machines. You can define the type of resources and the number of them that you can use. The developer does not need to worry about provisioning and bootstrapping the web application, it is done automatically by Azure.
1
3.1 D EPLOYING THE BASIC SERVICE Follow any of the tutorials in the Quickstarts section of the App Service documentation in the language of your choice. We suggest you the node.js example.
3.2 S ET UP THE AUTOSCALE OPTION Follow the tutorial Get started with Autoscale in Azure and select autoscale based on CPU utilization, with VM ranging from 1 to 3. It may take some time for the web app auto-scale to be configured.
3.3 T EST THE PERFORMANCE OF YOUR APPLICATION Follow the tutorial How To Use Apache JMeter To Perform Load Testing on a Web Server and start testing your system. Try to notice when the additional virtual machines are spawned and how it affects the overall performance of your web page. Additionally you should monitor the performance of your application following the instructions shown here.
4 C REATE A SECOND SERVICE Following the same steps as section 3, modify your application to provide a service that takes more processing resources than the main default webpage. In other words, add some computationally heavy function each time the GET operation is performed, just using a timer or wait is not enough given that we need that the CPU consumption increases. Simple options could be calculate π with a variable range of precision or calculate the prime numbers up to N, being N a variable number depending on the GET call.
5 M ONITOR Using the tools exposed in section 3.3, compare the results from the original web service with this one, primarily the number of request per minute that can receive and when new virtual machines are spawned.
6 D ELIVERABLE • The two running web services. • A fifteen minute load test for each of the services, that shows the difference in the number of request per second that the system can receive per minute, the latency’s and the number of virtual machines being used (auto-scale).
2
C OLLEGE OF C OMPUTING , G EORGIA I NSTITUTE OF T ECHNOLOGY
Apps Workshop 3A: Using Spark
In this module of the class, you are going to deploy an HDInsight Spark Cluster and run a word counter on it.
1 E XPECTED OUTCOME The student is going to: • Learn about the basics on deploying Spark in Azure • Run applications on top of Spark • Compare the differences between Spark and MapReduce
2 A SSUMPTIONS This workshop assumes that the student had some familiarity with the Azure portal and MapReduce.
3 A PACHE S PARK Apache Spark is an open-source processing framework that runs large-scale data applications. It can run tasks similar to the well known MapReduce, with the difference that it tries to maintain the dataset on memory as much as possible. There are multiple applications that can run on top of the Spark Core Engine, such as Spark SQL, Spark Streaming, MLib (Machine Learning) and GraphX (Graph Computation).
1
3.1 D EPLOYING THE BASIC SERVICE Follow the tutorial Create Apache Spark cluster on HDInsight Linux and then the tutorial https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-load-data-run-query. Finally, plot the values for MAX and MIN for the temperature difference.
3.2 W ORD C OUNT Using the PySpark notebook and Spark SQL, implement a word counter, similar to the one implemented in the MapReduce assignment in the previous course. Discuss with your fellow students about which dataset to use for counting words. Examples of interesting data sets is the completes work of Shakespeare, or any other author that you find interesting.
3.3 S PARK VS M AP R EDUCE Write a summary of the main differences between Map Reduce and Spark.
4 A DDITIONAL N OTES (READ) After running your assignment, save the python notebooks used to execute the commands and remove the HDInsight cluster, following this tutorial. This kind of system is charged by the amount of time that is allocated, which means after finishing you need to make sure that the cluster is deleted, to avoid excessive charges.
5 D ELIVERABLE • The example cluster running the max and min computation. • The Word Counter application. • Explain the main differences between MapReduce and Spark. • Discuss with your fellow students about the difference of Map Reduce and Spark, there are many online web pages that discuss the details of both, share those with other students. • Discuss with your fellow students about interesting applications of Spark, or ways you have used it in the past.
2
C OLLEGE OF C OMPUTING , G EORGIA I NSTITUTE OF T ECHNOLOGY
Apps Workshop 3B: Stream Processing of IoT Data
In this module of the class, you are going to use Storm to process simulated IoT data.
1 A CKNOWLEDGEMENT This workshop is based on the Azure Storm Tutorial that can be found here
2 E XPECTED OUTCOME The student is going to: • Learn about Apache Storm • Learn about Azure EventHub • Complete one of the Azure tutorials using EventHub with IoT information.
3 A SSUMPTIONS This workshop assumes that the student had some familiarity with the Azure portal and Java programming.
4 A PACHE S TORM Apache Storm is an open source distributed realtime computation framework. It is made to process unbounded streams of data, where a stream of data is considered to be any input that
1
is continuously (and maybe forever) generating data. Storm is scalable, fault-tolerant and guarantees that all the data generated will be eventually processed.
4.1 D EPLOYING THE SERVICE 1. Create a Apache Storm Cluster using the tutorial Create Linux-based clusters in HDInsigh and select Storm as the Cluster type. • Please notice the URL that appears after creating the cluster (see this link), you can use it to connect to the dashboard of your cluster. 2. Follow the tutorial Process events from Azure Event Hubs with Storm on HDInsight 3. Write the output to HDFS on Azure, which is detailed in the tutorial above.
5 D ELIVERABLE • Run the system and show the output generated in HDFS. • Discuss with other students about the meaning of stream, spout, and bolts in the context of Apache Storm. • Discuss with your fellow students about the applications of Apache Storm, where have you used it before and what is the difference with Spark.
6 O PTIONAL The vanilla workshop just asks to write the output to HDFS, but using Storm, it is possible to stream the data to a visualization dashboard using a custom Bolt. While our Azure Subscription does not support PowerBI, a stretch goal for this workshop would be to build a real time visualization dashboard(e.g. Tableau, Grafana, PowerBI) that Storm can communicate to with a custom bolt. This section is open ended and not mandatory, but serves as a fun investigation for those interested.
2
Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency Brad Calder, Ju Wang, Aaron Ogus, Niranjan Nilakantan, Arild Skjolsvold, Sam McKelvie, Yikang Xu, Shashwat Srivastav, Jiesheng Wu, Huseyin Simitci, Jaidev Haridas, Chakravarthy Uddaraju, Hemal Khatri, Andrew Edwards, Vaman Bedekar, Shane Mainali, Rafay Abbasi, Arpit Agarwal, Mian Fahim ul Haq, Muhammad Ikram ul Haq, Deepali Bhardwaj, Sowmya Dayanand, Anitha Adusumilli, Marvin McNett, Sriram Sankaran, Kavitha Manivannan, Leonidas Rigas
Microsoft workflow for many applications. A common usage pattern we see is incoming and outgoing data being shipped via Blobs, Queues providing the overall workflow for processing the Blobs, and intermediate service state and final results being kept in Tables or Blobs.
Abstract Windows Azure Storage (WAS) is a cloud storage system that provides customers the ability to store seemingly limitless amounts of data for any duration of time. WAS customers have access to their data from anywhere at any time and only pay for what they use and store. In WAS, data is stored durably using both local and geographic replication to facilitate disaster recovery. Currently, WAS storage comes in the form of Blobs (files), Tables (structured storage), and Queues (message delivery). In this paper, we describe the WAS architecture, global namespace, and data model, as well as its resource provisioning, load balancing, and replication systems.
An example of this pattern is an ingestion engine service built on Windows Azure to provide near real-time Facebook and Twitter search. This service is one part of a larger data processing pipeline that provides publically searchable content (via our search engine, Bing) within 15 seconds of a Facebook or Twitter user’s posting or status update. Facebook and Twitter send the raw public content to WAS (e.g., user postings, user status updates, etc.) to be made publically searchable. This content is stored in WAS Blobs. The ingestion engine annotates this data with user auth, spam, and adult scores; content classification; and classification for language and named entities. In addition, the engine crawls and expands the links in the data. While processing, the ingestion engine accesses WAS Tables at high rates and stores the results back into Blobs. These Blobs are then folded into the Bing search engine to make the content publically searchable. The ingestion engine uses Queues to manage the flow of work, the indexing jobs, and the timing of folding the results into the search engine. As of this writing, the ingestion engine for Facebook and Twitter keeps around 350TB of data in WAS (before replication). In terms of transactions, the ingestion engine has a peak traffic load of around 40,000 transactions per second and does between two to three billion transactions per day (see Section 7 for discussion of additional workload profiles).
Categories and Subject Descriptors D.4.2 [Operating Systems]: Storage Management—Secondary storage; D.4.3 [Operating Systems]: File Systems Management—Distributed file systems; D.4.5 [Operating Systems]: Reliability—Fault tolerance; D.4.7 [Operating Systems]: Organization and Design—Distributed systems; D.4.8 [Operating Systems]: Performance—Measurements
General Terms Algorithms, Design, Management, Measurement, Performance, Reliability.
Keywords Cloud storage, distributed storage systems, Windows Azure.
1. Introduction
In the process of building WAS, feedback from potential internal and external customers drove many design decisions. Some key design features resulting from this feedback include:
Windows Azure Storage (WAS) is a scalable cloud storage system that has been in production since November 2008. It is used inside Microsoft for applications such as social networking search, serving video, music and game content, managing medical records, and more. In addition, there are thousands of customers outside Microsoft using WAS, and anyone can sign up over the Internet to use the system.
Strong Consistency – Many customers want strong consistency: especially enterprise customers moving their line of business applications to the cloud. They also want the ability to perform conditional reads, writes, and deletes for optimistic concurrency control [12] on the strongly consistent data. For this, WAS provides three properties that the CAP theorem [2] claims are difficult to achieve at the same time: strong consistency, high availability, and partition tolerance (see Section 8).
WAS provides cloud storage in the form of Blobs (user files), Tables (structured storage), and Queues (message delivery). These three data abstractions provide the overall storage and
Global and Scalable Namespace/Storage – For ease of use, WAS implements a global namespace that allows data to be stored and accessed in a consistent manner from any location in the world. Since a major goal of WAS is to enable storage of massive amounts of data, this global namespace must be able to address exabytes of data and beyond. We discuss our global namespace design in detail in Section 2.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SOSP '11, October 23-26, 2011, Cascais, Portugal. Copyright © 2011 ACM 978-1-4503-0977-6/11/10 ... $10.00.
143
Disaster Recovery – WAS stores customer data across multiple data centers hundreds of miles apart from each other. This redundancy provides essential data recovery protection against disasters such as earthquakes, wild fires, tornados, nuclear reactor meltdown, etc.
primary key that consists of two properties: the PartitionName and the ObjectName. This distinction allows applications using Tables to group rows into the same partition to perform atomic transactions across them. For Queues, the queue name is the PartitionName and each message has an ObjectName to uniquely identify it within the queue.
Multi-tenancy and Cost of Storage – To reduce storage cost, many customers are served from the same shared storage infrastructure. WAS combines the workloads of many different customers with varying resource needs together so that significantly less storage needs to be provisioned at any one point in time than if those services were run on their own dedicated hardware.
3. High Level Architecture Here we present a high level discussion of the WAS architecture and how it fits into the Windows Azure Cloud Platform.
3.1 Windows Azure Cloud Platform The Windows Azure Cloud platform runs many cloud services across different data centers and different geographic regions. The Windows Azure Fabric Controller is a resource provisioning and management layer that provides resource allocation, deployment/upgrade, and management for cloud services on the Windows Azure platform. WAS is one such service running on top of the Fabric Controller.
We describe these design features in more detail in the following sections. The remainder of this paper is organized as follows. Section 2 describes the global namespace used to access the WAS Blob, Table, and Queue data abstractions. Section 3 provides a high level overview of the WAS architecture and its three layers: Stream, Partition, and Front-End layers. Section 4 describes the stream layer, and Section 5 describes the partition layer. Section 6 shows the throughput experienced by Windows Azure applications accessing Blobs and Tables. Section 7 describes some internal Microsoft workloads using WAS. Section 8 discusses design choices and lessons learned. Section 9 presents related work, and Section 10 summarizes the paper.
The Fabric Controller provides node management, network configuration, health monitoring, starting/stopping of service instances, and service deployment for the WAS system. In addition, WAS retrieves network topology information, physical layout of the clusters, and hardware configuration of the storage nodes from the Fabric Controller. WAS is responsible for managing the replication and data placement across the disks and load balancing the data and application traffic within the storage cluster.
2. Global Partitioned Namespace A key goal of our storage system is to provide a single global namespace that allows clients to address all of their storage in the cloud and scale to arbitrary amounts of storage needed over time. To provide this capability we leverage DNS as part of the storage namespace and break the storage namespace into three parts: an account name, a partition name, and an object name. As a result, all data is accessible via a URI of the form:
3.2 WAS Architectural Components An important feature of WAS is the ability to store and provide access to an immense amount of storage (exabytes and beyond). We currently have 70 petabytes of raw storage in production and are in the process of provisioning a few hundred more petabytes of raw storage based on customer demand for 2012.
http(s)://AccountName.1.core.windows.net/PartitionNa me/ObjectName
The WAS production system consists of Storage Stamps and the Location Service (shown in Figure 1).
The AccountName is the customer selected account name for accessing storage and is part of the DNS host name. The AccountName DNS translation is used to locate the primary storage cluster and data center where the data is stored. This primary location is where all requests go to reach the data for that account. An application may use multiple AccountNames to store its data across different locations.
https://AccountName.service.core.windows.net/ DNS Lookup Access Blobs, Tables and Queues for account
Partition Layer
Front-Ends
Inter-Stamp Replication
Partition Layer
Stream Layer Intra-Stamp Replication
Stream Layer Intra-Stamp Replication
Storage Stamp
Storage Stamp
Storage Stamps – A storage stamp is a cluster of N racks of storage nodes, where each rack is built out as a separate fault domain with redundant networking and power. Clusters typically range from 10 to 20 racks with 18 disk-heavy storage nodes per rack. Our first generation storage stamps hold approximately 2PB of raw storage each. Our next generation stamps hold up to 30PB of raw storage each.
specifies the service type, which can be blob, table, or queue. APIs for Windows Azure Blobs, Tables, and Queues can be found http://msdn.microsoft.com/en-us/library/dd179355.aspx
VIP
Figure 1: High-level architecture
This naming approach enables WAS to flexibly support its three data abstractions2. For Blobs, the full blob name is the PartitionName. For Tables, each entity (row) in the table has a
2
DNS
Front-Ends
When a PartitionName holds many objects, the ObjectName identifies individual objects within that partition. The system supports atomic transactions across objects with the same PartitionName value. The ObjectName is optional since, for some types of data, the PartitionName uniquely identifies the object within the account.
1
Account Management
VIP
In conjunction with the AccountName, the PartitionName locates the data once a request reaches the storage cluster. The PartitionName is used to scale out access to the data across storage nodes based on traffic needs.
Location Service
here:
144
To provide low cost cloud storage, we need to keep the storage provisioned in production as highly utilized as possible. Our goal is to keep a storage stamp around 70% utilized in terms of capacity, transactions, and bandwidth. We try to avoid going above 80% because we want to keep 20% in reserve for (a) disk short stroking to gain better seek time and higher throughput by utilizing the outer tracks of the disks and (b) to continue providing storage capacity and availability in the presence of a rack failure within a stamp. When a storage stamp reaches 70% utilization, the location service migrates accounts to different stamps using inter-stamp replication (see Section 3.4).
Partition Layer – The partition layer is built for (a) managing and understanding higher level data abstractions (Blob, Table, Queue), (b) providing a scalable object namespace, (c) providing transaction ordering and strong consistency for objects, (d) storing object data on top of the stream layer, and (e) caching object data to reduce disk I/O. Another responsibility of this layer is to achieve scalability by partitioning all of the data objects within a stamp. As described earlier, all objects have a PartitionName; they are broken down into disjointed ranges based on the PartitionName values and served by different partition servers. This layer manages which partition server is serving what PartitionName ranges for Blobs, Tables, and Queues. In addition, it provides automatic load balancing of PartitionNames across the partition servers to meet the traffic needs of the objects.
Location Service (LS) – The location service manages all the storage stamps. It is also responsible for managing the account namespace across all stamps. The LS allocates accounts to storage stamps and manages them across the storage stamps for disaster recovery and load balancing. The location service itself is distributed across two geographic locations for its own disaster recovery.
Front-End (FE) layer – The Front-End (FE) layer consists of a set of stateless servers that take incoming requests. Upon receiving a request, an FE looks up the AccountName, authenticates and authorizes the request, then routes the request to a partition server in the partition layer (based on the PartitionName). The system maintains a Partition Map that keeps track of the PartitionName ranges and which partition server is serving which PartitionNames. The FE servers cache the Partition Map and use it to determine which partition server to forward each request to. The FE servers also stream large objects directly from the stream layer and cache frequently accessed data for efficiency.
WAS provides storage from multiple locations in each of the three geographic regions: North America, Europe, and Asia. Each location is a data center with one or more buildings in that location, and each location holds multiple storage stamps. To provision additional capacity, the LS has the ability to easily add new regions, new locations to a region, or new stamps to a location. Therefore, to increase the amount of storage, we deploy one or more storage stamps in the desired location’s data center and add them to the LS. The LS can then allocate new storage accounts to those new stamps for customers as well as load balance (migrate) existing storage accounts from older stamps to the new stamps.
3.4 Two Replication Engines Before describing the stream and partition layers in detail, we first give a brief overview of the two replication engines in our system and their separate responsibilities.
Figure 1 shows the location service with two storage stamps and the layers within the storage stamps. The LS tracks the resources used by each storage stamp in production across all locations. When an application requests a new account for storing data, it specifies the location affinity for the storage (e.g., US North). The LS then chooses a storage stamp within that location as the primary stamp for the account using heuristics based on the load information across all stamps (which considers the fullness of the stamps and other metrics such as network and transaction utilization). The LS then stores the account metadata information in the chosen storage stamp, which tells the stamp to start taking traffic for the assigned account. The LS then updates DNS to allow requests to now route from the name https://AccountName.service.core.windows.net/ to that storage stamp’s virtual IP (VIP, an IP address the storage stamp exposes for external traffic).
Intra-Stamp Replication (stream layer) – This system provides synchronous replication and is focused on making sure all the data written into a stamp is kept durable within that stamp. It keeps enough replicas of the data across different nodes in different fault domains to keep data durable within the stamp in the face of disk, node, and rack failures. Intra-stamp replication is done completely by the stream layer and is on the critical path of the customer’s write requests. Once a transaction has been replicated successfully with intra-stamp replication, success can be returned back to the customer. Inter-Stamp Replication (partition layer) – This system provides asynchronous replication and is focused on replicating data across stamps. Inter-stamp replication is done in the background and is off the critical path of the customer’s request. This replication is at the object level, where either the whole object is replicated or recent delta changes are replicated for a given account. Inter-stamp replication is used for (a) keeping a copy of an account’s data in two locations for disaster recovery and (b) migrating an account’s data between stamps. Inter-stamp replication is configured for an account by the location service and performed by the partition layer.
3.3 Three Layers within a Storage Stamp Also shown in Figure 1 are the three layers within a storage stamp. From bottom up these are: Stream Layer – This layer stores the bits on disk and is in charge of distributing and replicating the data across many servers to keep data durable within a storage stamp. The stream layer can be thought of as a distributed file system layer within a stamp. It understands files, called “streams” (which are ordered lists of large storage chunks called “extents”), how to store them, how to replicate them, etc., but it does not understand higher level object constructs or their semantics. The data is stored in the stream layer, but it is accessible from the partition layer. In fact, partition servers (daemon processes in the partition layer) and stream servers are co-located on each storage node in a stamp.
Inter-stamp replication is focused on replicating objects and the transactions applied to those objects, whereas intra-stamp replication is focused on replicating blocks of disk storage that are used to make up the objects. We separated replication into intra-stamp and inter-stamp at these two different layers for the following reasons. Intra-stamp replication provides durability against hardware failures, which occur frequently in large scale systems, whereas inter-stamp replication provides geo-redundancy against geo-disasters, which
145
are rare. It is crucial to provide intra-stamp replication with low latency, since that is on the critical path of user requests; whereas the focus of inter-stamp replication is optimal use of network bandwidth between stamps while achieving an acceptable level of replication delay. They are different problems addressed by the two replication schemes.
and consists of a sequence of blocks. The target extent size used by the partition layer is 1GB. To store small objects, the partition layer appends many of them to the same extent and even in the same block; to store large TB-sized objects (Blobs), the object is broken up over many extents by the partition layer. The partition layer keeps track of what streams, extents, and byte offsets in the extents in which objects are stored as part of its index.
Another reason for creating these two separate replication layers is the namespace each of these two layers has to maintain. Performing intra-stamp replication at the stream layer allows the amount of information that needs to be maintained to be scoped by the size of a single storage stamp. This focus allows all of the meta-state for intra-stamp replication to be cached in memory for performance (see Section 4), enabling WAS to provide fast replication with strong consistency by quickly committing transactions within a single stamp for customer requests. In contrast, the partition layer combined with the location service controls and understands the global object namespace across stamps, allowing it to efficiently replicate and maintain object state across data centers.
Streams – Every stream has a name in the hierarchical namespace maintained at the stream layer, and a stream looks like a big file to the partition layer. Streams are appended to and can be randomly read from. A stream is an ordered list of pointers to extents which is maintained by the Stream Manager. When the extents are concatenated together they represent the full contiguous address space in which the stream can be read in the order they were added to the stream. A new stream can be constructed by concatenating extents from existing streams, which is a fast operation since it just updates a list of pointers. Only the last extent in the stream can be appended to. All of the prior extents in the stream are immutable.
4.1 Stream Manager and Extent Nodes
4. Stream Layer
The two main architecture components of the stream layer are the Stream Manager (SM) and Extent Node (EN) (shown in Figure 3).
The stream layer provides an internal interface used only by the partition layer. It provides a file system like namespace and API, except that all writes are append-only. It allows clients (the partition layer) to open, close, delete, rename, read, append to, and concatenate these large files, which are called streams. A stream is an ordered list of extent pointers, and an extent is a sequence of append blocks.
Pointer to Extent E3
B11 B12 ….. B1x
B21 B22 ….. B2y
B31 B32 ….. B3z
Extent E1 - Sealed
Extent E2 - Sealed
Extent E3 - Sealed
SM SM SM B. Allocate extent replica set
1 write 2
7
EN ack
Stream //foo Pointer to Extent E2
A. Create extent
Partition Layer/ Client
Figure 2 shows stream “//foo”, which contains (pointers to) four extents (E1, E2, E3, and E4). Each extent contains a set of blocks that were appended to it. E1, E2 and E3 are sealed extents. It means that they can no longer be appended to; only the last extent in a stream (E4) can be appended to. If an application reads the data of the stream from beginning to end, it would get the block contents of the extents in the order of E1, E2, E3 and E4.
Pointer to Extent E1
Stream Layer
paxos
2
EN 6
3
3 5
EN 4
Primary
Secondary
Secondary
EN
EN
EN
EN
EN
Pointer to Extent E4
B41 B42 B43
Figure 3: Stream Layer Architecture
Extent E4 - Unsealed
Stream Manager (SM) – The SM keeps track of the stream namespace, what extents are in each stream, and the extent allocation across the Extent Nodes (EN). The SM is a standard Paxos cluster [13] as used in prior storage systems [3], and is off the critical path of client requests. The SM is responsible for (a) maintaining the stream namespace and state of all active streams and extents, (b) monitoring the health of the ENs, (c) creating and assigning extents to ENs, (d) performing the lazy re-replication of extent replicas that are lost due to hardware failures or unavailability, (e) garbage collecting extents that are no longer pointed to by any stream, and (f) scheduling the erasure coding of extent data according to stream policy (see Section 4.4).
Figure 2: Example stream with four extents In more detail these data concepts are: Block – This is the minimum unit of data for writing and reading. A block can be up to N bytes (e.g. 4MB). Data is written (appended) as one or more concatenated blocks to an extent, where blocks do not have to be the same size. The client does an append in terms of blocks and controls the size of each block. A client read gives an offset to a stream or extent, and the stream layer reads as many blocks as needed at the offset to fulfill the length of the read. When performing a read, the entire contents of a block are read. This is because the stream layer stores its checksum validation at the block level, one checksum per block. The whole block is read to perform the checksum validation, and it is checked on every block read. In addition, all blocks in the system are validated against their checksums once every few days to check for data integrity issues.
The SM periodically polls (syncs) the state of the ENs and what extents they store. If the SM discovers that an extent is replicated on fewer than the expected number of ENs, a re-replication of the extent will lazily be created by the SM to regain the desired level of replication. For extent replica placement, the SM randomly chooses ENs across different fault domains, so that they are stored on nodes that will not have correlated failures due to power, network, or being on the same rack.
Extent – Extents are the unit of replication in the stream layer, and the default replication policy is to keep three replicas within a storage stamp for an extent. Each extent is stored in an NTFS file
146
The SM does not know anything about blocks, just streams and extents. The SM is off the critical path of client requests and does not track each block append, since the total number of blocks can be huge and the SM cannot scale to track those. Since the stream and extent state is only tracked within a single stamp, the amount of state can be kept small enough to fit in the SM’s memory. The only client of the stream layer is the partition layer, and the partition layer and stream layer are co-designed so that they will not use more than 50 million extents and no more than 100,000 streams for a single storage stamp given our current stamp sizes. This parameterization can comfortably fit into 32GB of memory for the SM.
of the partition layer providing strong consistency is built upon the following guarantees from the stream layer: 1. Once a record is appended and acknowledged back to the client, any later reads of that record from any replica will see the same data (the data is immutable). 2. Once an extent is sealed, any reads from any sealed replica will always see the same contents of the extent. The data center, Fabric Controller, and WAS have security mechanisms in place to guard against malicious adversaries, so the stream replication does not deal with such threats. We consider faults ranging from disk and node errors to power failures, network issues, bit-flip and random hardware failures, as well as software bugs. These faults can cause data corruption; checksums are used to detect such corruption. The rest of the section discusses the intra-stamp replication scheme within this context.
Extent Nodes (EN) – Each extent node maintains the storage for a set of extent replicas assigned to it by the SM. An EN has N disks attached, which it completely controls for storing extent replicas and their blocks. An EN knows nothing about streams, and only deals with extents and blocks. Internally on an EN server, every extent on disk is a file, which holds data blocks and their checksums, and an index which maps extent offsets to blocks and their file location. Each extent node contains a view about the extents it owns and where the peer replicas are for a given extent. This view is a cache kept by the EN of the global state the SM keeps. ENs only talk to other ENs to replicate block writes (appends) sent by a client, or to create additional copies of an existing replica when told to by the SM. When an extent is no longer referenced by any stream, the SM garbage collects the extent and notifies the ENs to reclaim the space.
4.3.1 Replication Flow As shown in Figure 3, when a stream is first created (step A), the SM assigns three replicas for the first extent (one primary and two secondary) to three extent nodes (step B), which are chosen by the SM to randomly spread the replicas across different fault and upgrade domains while considering extent node usage (for load balancing). In addition, the SM decides which replica will be the primary for the extent. Writes to an extent are always performed from the client to the primary EN, and the primary EN is in charge of coordinating the write to two secondary ENs. The primary EN and the location of the three replicas never change for an extent while it is being appended to (while the extent is unsealed). Therefore, no leases are used to represent the primary EN for an extent, since the primary is always fixed while an extent is unsealed.
4.2 Append Operation and Sealed Extent Streams can only be appended to; existing data cannot be modified. The append operations are atomic: either the entire data block is appended, or nothing is. Multiple blocks can be appended at once, as a single atomic “multi-block append” operation. The minimum read size from a stream is a single block. The “multi-block append” operation allows us to write a large amount of sequential data in a single append and to later perform small reads. The contract used between the client (partition layer) and the stream layer is that the multi-block append will occur atomically, and if the client never hears back for a request (due to failure) the client should retry the request (or seal the extent). This contract implies that the client needs to expect the same block to be appended more than once in face of timeouts and correctly deal with processing duplicate records. The partition layer deals with duplicate records in two ways (see Section 5 for details on the partition layer streams). For the metadata and commit log streams, all of the transactions written have a sequence number and duplicate records will have the same sequence number. For the row data and blob data streams, for duplicate writes, only the last write will be pointed to by the RangePartition data structures, so the prior duplicate writes will have no references and will be garbage collected later.
When the SM allocates the extent, the extent information is sent back to the client, which then knows which ENs hold the three replicas and which one is the primary. This state is now part of the stream’s metadata information held in the SM and cached on the client. When the last extent in the stream that is being appended to becomes sealed, the same process repeats. The SM then allocates another extent, which now becomes the last extent in the stream, and all new appends now go to the new last extent for the stream. For an extent, every append is replicated three times across the extent’s replicas. A client sends all write requests to the primary EN, but it can read from any replica, even for unsealed extents. The append is sent to the primary EN for the extent by the client, and the primary is then in charge of (a) determining the offset of the append in the extent, (b) ordering (choosing the offset of) all of the appends if there are concurrent append requests to the same extent outstanding, (c) sending the append with its chosen offset to the two secondary extent nodes, and (d) only returning success for the append to the client after a successful append has occurred to disk for all three extent nodes. The sequence of steps during an append is shown in Figure 3 (labeled with numbers). Only when all of the writes have succeeded for all three replicas will the primary EN then respond to the client that the append was a success. If there are multiple outstanding appends to the same extent, the primary EN will respond success in the order of their offset (commit them in order) to the clients. As appends commit in order for a replica, the last append position is considered to be the current commit length of the replica. We ensure that the bits are the same between all replicas by the fact that the primary EN for an extent never changes, it always picks the offset for appends,
An extent has a target size, specified by the client (partition layer), and when it fills up to that size the extent is sealed at a block boundary, and then a new extent is added to the stream and appends continue into that new extent. Once an extent is sealed it can no longer be appended to. A sealed extent is immutable, and the stream layer performs certain optimizations on sealed extents like erasure coding cold extents. Extents in a stream do not have to be the same size, and they can be sealed anytime and can even grow arbitrarily large.
4.3 Stream Layer Intra-Stamp Replication The stream layer and partition layer are co-designed to provide strong consistency at the object transaction level. The correctness
147
appends for an extent are committed in order, and how extents are sealed upon failures (discussed in Section 4.3.2).
1. Read records at known locations. The partition layer uses two types of data streams (row and blob). For these streams, it always reads at specific locations (extent+offset, length). More importantly, the partition layer will only read these two streams using the location information returned from a previous successful append at the stream layer. That will only occur if the append was successfully committed to all three replicas. The replication scheme guarantees such reads always see the same data.
When a stream is opened, the metadata for its extents is cached at the client, so the client can go directly to the ENs for reading and writing without talking to the SM until the next extent needs to be allocated for the stream. If during writing, one of the replica’s ENs is not reachable or there is a disk failure for one of the replicas, a write failure is returned to the client. The client then contacts the SM, and the extent that was being appended to is sealed by the SM at its current commit length (see Section 4.3.2). At this point the sealed extent can no longer be appended to. The SM will then allocate a new extent with replicas on different (available) ENs, which makes it now the last extent of the stream. The information for this new extent is returned to the client. The client then continues appending to the stream with its new extent. This process of sealing by the SM and allocating the new extent is done on average within 20ms. A key point here is that the client can continue appending to a stream as soon as the new extent has been allocated, and it does not rely on a specific node to become available again.
2. Iterate all records sequentially in a stream on partition load. Each partition has two additional streams (metadata and commit log). These are the only streams that the partition layer will read sequentially from a starting point to the very last record of a stream. This operation only occurs when the partition is loaded (explained in Section 5). The partition layer ensures that no useful appends from the partition layer will happen to these two streams during partition load. Then the partition and stream layer together ensure that the same sequence of records is returned on partition load. At the start of a partition load, the partition server sends a “check for commit length” to the primary EN of the last extent of these two streams. This checks whether all the replicas are available and that they all have the same length. If not, the extent is sealed and reads are only performed, during partition load, against a replica sealed by the SM. This ensures that the partition load will see all of its data and the exact same view, even if we were to repeatedly load the same partition reading from different sealed replicas for the last extent of the stream.
For the newly sealed extent, the SM will create new replicas to bring it back to the expected level of redundancy in the background if needed.
4.3.2 Sealing From a high level, the SM coordinates the sealing operation among the ENs; it determines the commit length of the extent used for sealing based on the commit length of the extent replicas. Once the sealing is done, the commit length will never change again.
4.4 Erasure Coding Sealed Extents To reduce the cost of storage, WAS erasure codes sealed extents for Blob storage. WAS breaks an extent into N roughly equal sized fragments at block boundaries. Then, it adds M error correcting code fragments using Reed-Solomon for the erasure coding algorithm [19]. As long as it does not lose more than M fragments (across the data fragments + code fragments), WAS can recreate the full extent.
To seal an extent, the SM asks all three ENs their current length. During sealing, either all replicas have the same length, which is the simple case, or a given replica is longer or shorter than another replica for the extent. This latter case can only occur during an append failure where some but not all of the ENs for the replica are available (i.e., some of the replicas get the append block, but not all of them). We guarantee that the SM will seal the extent even if the SM may not be able to reach all the ENs involved. When sealing the extent, the SM will choose the smallest commit length based on the available ENs it can talk to. This will not cause data loss since the primary EN will not return success unless all replicas have been written to disk for all three ENs. This means the smallest commit length is sure to contain all the writes that have been acknowledged to the client. In addition, it is also fine if the final length contains blocks that were never acknowledged back to the client, since the client (partition layer) correctly deals with these as described in Section 4.2. During the sealing, all of the extent replicas that were reachable by the SM are sealed to the commit length chosen by the SM.
Erasure coding sealed extents is an important optimization, given the amount of data we are storing. It reduces the cost of storing data from three full replicas within a stamp, which is three times the original data, to only 1.3x – 1.5x the original data, depending on the number of fragments used. In addition, erasure coding actually increases the durability of the data when compared to keeping three replicas within a stamp.
4.5 Read Load-Balancing When reads are issued for an extent that has three replicas, they are submitted with a “deadline” value which specifies that the read should not be attempted if it cannot be fulfilled within the deadline. If the EN determines the read cannot be fulfilled within the time constraint, it will immediately reply to the client that the deadline cannot be met. This mechanism allows the client to select a different EN to read that data from, likely allowing the read to complete faster.
Once the sealing is done, the commit length of the extent will never be changed. If an EN was not reachable by the SM during the sealing process but later becomes reachable, the SM will force the EN to synchronize the given extent to the chosen commit length. This ensures that once an extent is sealed, all its available replicas (the ones the SM can eventually reach) are bitwise identical.
This method is also used with erasure coded data. When reads cannot be serviced in a timely manner due to a heavily loaded spindle to the data fragment, the read may be serviced faster by doing a reconstruction rather than reading that data fragment. In this case, reads (for the range of the fragment needed to satisfy the client request) are issued to all fragments of an erasure coded extent, and the first N responses are used to reconstruct the desired fragment.
4.3.3 Interaction with Partition Layer An interesting case is when, due to network partitioning, a client (partition server) is still able to talk to an EN that the SM could not talk to during the sealing process. This section explains how the partition layer handles this case. The partition layer has two different read patterns:
148
scalable namespace for the objects, (d) load balancing to access objects across the available partition servers, and (e) transaction ordering and strong consistency for access to objects.
4.6 Spindle Anti-Starvation Many hard disk drives are optimized to achieve the highest possible throughput, and sacrifice fairness to achieve that goal. They tend to prefer reads or writes that are sequential. Since our system contains many streams that can be very large, we observed in developing our service that some disks would lock into servicing large pipelined reads or writes while starving other operations. On some disks we observed this could lock out nonsequential IO for as long as 2300 milliseconds. To avoid this problem we avoid scheduling new IO to a spindle when there is over 100ms of expected pending IO already scheduled or when there is any pending IO request that has been scheduled but not serviced for over 200ms. Using our own custom IO scheduling allows us to achieve fairness across reads/writes at the cost of slightly increasing overall latency on some sequential requests.
5.1 Partition Layer Data Model The partition layer provides an important internal data structure called an Object Table (OT). An OT is a massive table which can grow to several petabytes. Object Tables are dynamically broken up into RangePartitions (based on traffic load to the table) and spread across Partition Servers (Section 5.2) in a stamp. A RangePartition is a contiguous range of rows in an OT from a given low-key to a high-key. All RangePartitions for a given OT are non-overlapping, and every row is represented in some RangePartition. The following are the Object Tables used by the partition layer. The Account Table stores metadata and configuration for each storage account assigned to the stamp. The Blob Table stores all blob objects for all accounts in the stamp. The Entity Table stores all entity rows for all accounts in the stamp; it is used for the public Windows Azure Table data abstraction. The Message Table stores all messages for all accounts’ queues in the stamp. The Schema Table keeps track of the schema for all OTs. The Partition Map Table keeps track of the current RangePartitions for all Object Tables and what partition server is serving each RangePartition. This table is used by the Front-End servers to route requests to the corresponding partition servers.
4.7 Durability and Journaling The durability contract for the stream layer is that when data is acknowledged as written by the stream layer, there must be at least three durable copies of the data stored in the system. This contract allows the system to maintain data durability even in the face of a cluster-wide power failure. We operate our storage system in such a way that all writes are made durable to power safe storage before they are acknowledged back to the client. As part of maintaining the durability contract while still achieving good performance, an important optimization for the stream layer is that on each extent node we reserve a whole disk drive or SSD as a journal drive for all writes into the extent node. The journal drive [11] is dedicated solely for writing a single sequential journal of data, which allows us to reach the full write throughput potential of the device. When the partition layer does a stream append, the data is written by the primary EN while in parallel sent to the two secondaries to be written. When each EN performs its append, it (a) writes all of the data for the append to the journal drive and (b) queues up the append to go to the data disk where the extent file lives on that EN. Once either succeeds, success can be returned. If the journal succeeds first, the data is also buffered in memory while it goes to the data disk, and any reads for that data are served from memory until the data is on the data disk. From that point on, the data is served from the data disk. This also enables the combining of contiguous writes into larger writes to the data disk, and better scheduling of concurrent writes and reads to get the best throughput. It is a tradeoff for good latency at the cost of an extra write off the critical path.
Each of the above OTs has a fixed schema stored in the Schema Table. The primary key for the Blob Table, Entity Table, and Message Table consists of three properties: AccountName, PartitionName, and ObjectName. These properties provide the indexing and sort order for those Object Tables.
5.1.1 Supported Data Types and Operations The property types supported for an OT’s schema are the standard simple types (bool, binary, string, DateTime, double, GUID, int32, int64). In addition, the system supports two special types – DictionaryType and BlobType. The DictionaryType allows for flexible properties (i.e., without a fixed schema) to be added to a row at any time. These flexible properties are stored inside of the dictionary type as (name, type, value) tuples. From a data access standpoint, these flexible properties behave like first-order properties of the row and are queryable just like any other property in the row. The BlobType is a special property used to store large amounts of data and is currently used only by the Blob Table. BlobType avoids storing the blob data bits with the row properties in the “row data stream”. Instead, the blob data bits are stored in a separate “blob data stream” and a pointer to the blob’s data bits (list of “extent + offset, length” pointers) is stored in the BlobType’s property in the row. This keeps the large data bits separated from the OT’s queryable row property values stored in the row data stream.
Even though the stream layer is an append-only system, we found that adding a journal drive provided important benefits, since the appends do not have to contend with reads going to the data disk in order to commit the result back to the client. The journal allows the append times from the partition layer to have more consistent and lower latencies. Take for example the partition layer’s commit log stream, where an append is only as fast as the slowest EN for the replicas being appended to. For small appends to the commit log stream without journaling we saw an average end-to-end stream append latency of 30ms. With journaling we see an average append latency of 6ms. In addition, the variance of latencies decreased significantly.
OTs support standard operations including insert, update, and delete operations on rows as well as query/get operations. In addition, OTs allows batch transactions across rows with the same PartitionName value. The operations in a single batch are committed as a single transaction. Finally, OTs provide snapshot isolation to allow read operations to happen concurrently with writes.
5. Partition Layer
5.2 Partition Layer Architecture
The partition layer stores the different types of objects and understands what a transaction means for a given object type (Blob, Table, or Queue). The partition layer provides the (a) data model for the different types of objects stored, (b) logic and semantics to process the different types of objects, (c) massively
The partition layer has three main architectural components as shown in Figure 4: a Partition Manager (PM), Partition Servers (PS), and a Lock Service.
149
Lock Service
Monitor Lease Status Lookup partition Front End/ Client
Partition Map Table
Update
5.3.1 Persistent Data Structure A RangePartition uses a Log-Structured Merge-Tree [17,4] to maintain its persistent data. Each Object Table’s RangePartition consists of its own set of streams in the stream layer, and the streams belong solely to a given RangePartition, though the underlying extents can be pointed to by multiple streams in different RangePartitions due to RangePartition splitting. The following are the set of streams that comprise each RangePartition (shown in Figure 5):
PM Lease Renewal
Partition Assignment Load Balance writes
reads
PS1
PS2
PS3
Read/Query
Write Persist partition state
Read partition state from streams
Partition Layer
Memory Table
RangePartition Memory Data Module Row Page Cache Index cache
Load Metrics
Stream Layer
Figure 4: Partition Layer Architecture
Adaptive Bloom Filters Range Profiling
Partition Manager (PM) – Responsible for keeping track of and splitting the massive Object Tables into RangePartitions and assigning each RangePartition to a Partition Server to serve access to the objects. The PM splits the Object Tables into N RangePartitions in each stamp, keeping track of the current RangePartition breakdown for each OT and to which partition servers they are assigned. The PM stores this assignment in the Partition Map Table. The PM ensures that each RangePartition is assigned to exactly one active partition server at any time, and that two RangePartitions do not overlap. It is also responsible for load balancing RangePartitions among partition servers. Each stamp has multiple instances of the PM running, and they all contend for a leader lock that is stored in the Lock Service (see below). The PM with the lease is the active PM controlling the partition layer.
Persistent Data for a Range Partition (Data Stored in Stream Layer) Row Data Stream Commit Log Stream
Checkpoint Checkpoint Checkpoint Blob Data Stream
Metadata Stream Extent Ptr
Extent Ptr
Extent Ptr
Figure 5: RangePartition Data Structures Metadata Stream – The metadata stream is the root stream for a RangePartition. The PM assigns a partition to a PS by providing the name of the RangePartition’s metadata stream. The metadata stream contains enough information for the PS to load a RangePartition, including the name of the commit log stream and data streams for that RangePartition, as well as pointers (extent+offset) into those streams for where to start operating in those streams (e.g., where to start processing in the commit log stream and the root of the index for the row data stream). The PS serving the RangePartition also writes in the metadata stream the status of outstanding split and merge operations that the RangePartition may be involved in.
Partition Server (PS) – A partition server is responsible for serving requests to a set of RangePartitions assigned to it by the PM. The PS stores all the persistent state of the partitions into streams and maintains a memory cache of the partition state for efficiency. The system guarantees that no two partition servers can serve the same RangePartition at the same time by using leases with the Lock Service. This allows the PS to provide strong consistency and ordering of concurrent transactions to objects for a RangePartition it is serving. A PS can concurrently serve multiple RangePartitions from different OTs. In our deployments, a PS serves on average ten RangePartitions at any time.
Commit Log Stream – Is a commit log used to store the recent insert, update, and delete operations applied to the RangePartition since the last checkpoint was generated for the RangePartition.
Lock Service – A Paxos Lock Service [3,13] is used for leader election for the PM. In addition, each PS also maintains a lease with the lock service in order to serve partitions. We do not go into the details of the PM leader election, or the PS lease management, since the concepts used are similar to those described in the Chubby Lock [3] paper.
Row Data Stream – Stores the checkpoint row data and index for the RangePartition. Blob Data Stream – Is only used by the Blob Table to store the blob data bits. Each of the above is a separate stream in the stream layer owned by an Object Table’s RangePartition.
On partition server failure, all N RangePartitions served by the failed PS are assigned to available PSs by the PM. The PM will choose N (or fewer) partition servers, based on the load on those servers. The PM will assign a RangePartition to a PS, and then update the Partition Map Table specifying what partition server is serving each RangePartition. This allows the Front-End layer to find the location of RangePartitions by looking in the Partition Map Table (see Figure 4). When the PS gets a new assignment it will start serving the new RangePartitions for as long as the PS holds its partition server lease.
Each RangePartition in an Object Table has only one data stream, except the Blob Table. A RangePartition in the Blob Table has a “row data stream” for storing its row checkpoint data (the blob index), and a separate “blob data stream” for storing the blob data bits for the special BlobType described earlier.
5.3.2 In-Memory Data Structures A partition server maintains the following in-memory components as shown in Figure 5:
5.3 RangePartition Data Structures
Memory Table – This is the in-memory version of the commit log for a RangePartition, containing all of the recent updates that have not yet been checkpointed to the row data stream. When a
A PS serves a RangePartition by maintaining a set of in-memory data structures and a set of persistent data structures in streams.
150
lookup occurs the memory table is checked to find recent updates to the RangePartition.
Split – This operation identifies when a single RangePartition has too much load and splits the RangePartition into two or more smaller and disjoint RangePartitions, then load balances (reassigns) them across two or more partition servers.
Index Cache – This cache stores the checkpoint indexes of the row data stream. We separate this cache out from the row data cache to make sure we keep as much of the main index cached in memory as possible for a given RangePartition.
Merge – This operation merges together cold or lightly loaded RangePartitions that together form a contiguous key range within their OT. Merge is used to keep the number of RangePartitions within a bound proportional to the number of partition servers in a stamp.
Row Data Cache – This is a memory cache of the checkpoint row data pages. The row data cache is read-only. When a lookup occurs, both the row data cache and the memory table are checked, giving preference to the memory table.
WAS keeps the total number of partitions between a low watermark and a high watermark (typically around ten times the partition server count within a stamp). At equilibrium, the partition count will stay around the low watermark. If there are unanticipated traffic bursts that concentrate on a single RangePartition, it will be split to spread the load. When the total RangePartition count is approaching the high watermark, the system will increase the merge rate to eventually bring the RangePartition count down towards the low watermark. Therefore, the number of RangePartitions for each OT changes dynamically based upon the load on the objects in those tables.
Bloom Filters – If the data is not found in the memory table or the row data cache, then the index/checkpoints in the data stream need to be searched. It can be expensive to blindly examine them all. Therefore a bloom filter is kept for each checkpoint, which indicates if the row being accessed may be in the checkpoint. We do not go into further details about these components, since these are similar to those in [17,4].
5.4 Data Flow When the PS receives a write request to the RangePartition (e.g., insert, update, delete), it appends the operation into the commit log, and then puts the newly changed row into the memory table. Therefore, all the modifications to the partition are recorded persistently in the commit log, and also reflected in the memory table. At this point success can be returned back to the client (the FE servers) for the transaction. When the size of the memory table reaches its threshold size or the size of the commit log stream reaches its threshold, the partition server will write the contents of the memory table into a checkpoint stored persistently in the row data stream for the RangePartition. The corresponding portion of the commit log can then be removed. To control the total number of checkpoints for a RangePartition, the partition server will periodically combine the checkpoints into larger checkpoints, and then remove the old checkpoints via garbage collection.
Having a high watermark of RangePartitions ten times the number of partition servers (a storage stamp has a few hundred partition servers) was chosen based on how big we can allow the stream and extent metadata to grow for the SM, and still completely fit the metadata in memory for the SM. Keeping many more RangePartitions than partition servers enables us to quickly distribute a failed PS or rack’s load across many other PSs. A given partition server can end up serving a single extremely hot RangePartition, tens of lightly loaded RangePartitions, or a mixture in-between, depending upon the current load to the RangePartitions in the stamp. The number of RangePartitions for the Blob Table vs. Entity Table vs. Message Table depends upon the load on the objects in those tables and is continuously changing within a storage stamp based upon traffic.
For the Blob Table’s RangePartitions, we also store the Blob data bits directly into the commit log stream (to minimize the number of stream writes for Blob operations), but those data bits are not part of the row data so they are not put into the memory table. Instead, the BlobType property for the row tracks the location of the Blob data bits (extent+offset, length). During checkpoint, the extents that would be removed from the commit log are instead concatenated to the RangePartition’s Blob data stream. Extent concatenation is a fast operation provided by the stream layer since it consists of just adding pointers to extents at the end of the Blob data stream without copying any data.
For each stamp, we typically see 75 splits and merges and 200 RangePartition load balances per day.
5.5.1 Load Balance Operation Details We track the load for each RangePartition as well as the overall load for each PS. For both of these we track (a) transactions/second, (b) average pending transaction count, (c) throttling rate, (d) CPU usage, (e) network usage, (f) request latency, and (g) data size of the RangePartition. The PM maintains heartbeats with each PS. This information is passed back to the PM in responses to the heartbeats. If the PM sees a RangePartition that has too much load based upon the metrics, then it will decide to split the partition and send a command to the PS to perform the split. If instead a PS has too much load, but no individual RangePartition seems to be too highly loaded, the PM will take one or more RangePartitions from the PS and reassign them to a more lightly loaded PS.
A PS can start serving a RangePartition by “loading” the partition. Loading a partition involves reading the metadata stream of the RangePartition to locate the active set of checkpoints and replaying the transactions in the commit log to rebuild the inmemory state. Once these are done, the PS has the up-to-date view of the RangePartition and can start serving requests.
To load balance a RangePartition, the PM sends an offload command to the PS, which will have the RangePartition write a current checkpoint before offloading it. Once complete, the PS acks back to the PM that the offload is done. The PM then assigns the RangePartition to its new PS and updates the Partition Map Table to point to the new PS. The new PS loads and starts serving traffic for the RangePartition. The loading of the RangePartition on the new PS is very quick since the commit log is small due to the checkpoint prior to the offload.
5.5 RangePartition Load Balancing A critical part of the partition layer is breaking these massive Object Tables into RangePartitions and automatically load balancing them across the partition servers to meet their varying traffic demands. The PM performs three operations to spread load across partition servers and control the total number of partitions in a stamp: Load Balance – This operation identifies when a given PS has too much traffic and reassigns one or more RangePartitions to less loaded partition servers.
151
6. The PM then updates the Partition Map Table and its metadata information to reflect the merge.
5.5.2 Split Operation WAS splits a RangePartition due to too much load as well as the size of its row or blob data streams. If the PM identifies either situation, it tells the PS serving the RangePartition to split based upon load or size. The PM makes the decision to split, but the PS chooses the key (AccountName, PartitionName) where the partition will be split. To split based upon size, the RangePartition maintains the total size of the objects in the partition and the split key values where the partition can be approximately halved in size, and the PS uses that to pick the key for where to split. If the split is based on load, the PS chooses the key based upon Adaptive Range Profiling [16]. The PS adaptively tracks which key ranges in a RangePartition have the most load and uses this to determine on what key to split the RangePartition.
5.6 Partition Layer Inter-Stamp Replication Thus far we have talked about an AccountName being associated (via DNS) to a single location and storage stamp, where all data access goes to that stamp. We call this the primary stamp for an account. An account actually has one or more secondary stamps assigned to it by the Location Service, and this primary/secondary stamp information tells WAS to perform inter-stamp replication for this account from the primary stamp to the secondary stamp(s). One of the main scenarios for inter-stamp replication is to georeplicate an account’s data between two data centers for disaster recovery. In this scenario, a primary and secondary location is chosen for the account. Take, for example, an account, for which we want the primary stamp (P) to be located in US South and the secondary stamp (S) to be located in US North. When provisioning the account, the LS will choose a stamp in each location and register the AccountName with both stamps such that the US South stamp (P) takes live traffic and the US North stamp (S) will take only inter-stamp replication (also called georeplication) traffic from stamp P for the account. The LS updates DNS to have hostname AccountName.service.core.windows.net point to the storage stamp P’s VIP in US South. When a write comes into stamp P for the account, the change is fully replicated within that stamp using intra-stamp replication at the stream layer then success is returned to the client. After the update has been committed in stamp P, the partition layer in stamp P will asynchronously geo-replicate the change to the secondary stamp S using inter-stamp replication. When the change arrives at stamp S, the transaction is applied in the partition layer and this update fully replicates using intra-stamp replication within stamp S.
To split a RangePartition (B) into two new RangePartitions (C,D), the following steps are taken. 1. The PM instructs the PS to split B into C and D. 2. The PS in charge of B checkpoints B, then stops serving traffic briefly during step 3 below. 3. The PS uses a special stream operation “MultiModify” to take each of B’s streams (metadata, commit log and data) and creates new sets of streams for C and D respectively with the same extents in the same order as in B. This step is very fast, since a stream is just a list of pointers to extents. The PS then appends the new partition key ranges for C and D to their metadata streams. 4. The PS starts serving requests to the two new partitions C and D for their respective disjoint PartitionName ranges. 5. The PS notifies the PM of the split completion, and the PM updates the Partition Map Table and its metadata information accordingly. The PM then moves one of the split partitions to a different PS.
Since the inter-stamp replication is done asynchronously, recent updates that have not been inter-stamp replicated can be lost in the event of disaster. In production, changes are geo-replicated and committed on the secondary stamp within 30 seconds on average after the update was committed on the primary stamp.
5.5.3 Merge Operation To merge two RangePartitions, the PM will choose two RangePartitions C and D with adjacent PartitionName ranges that have low traffic. The following steps are taken to merge C and D into a new RangePartition E.
Inter-stamp replication is used for both account geo-replication and migration across stamps. For disaster recovery, we may need to perform an abrupt failover where recent changes may be lost, but for migration we perform a clean failover so there is no data loss. In both failover scenarios, the Location Service makes an active secondary stamp for the account the new primary and switches DNS to point to the secondary stamp’s VIP. Note that the URI used to access the object does not change after failover. This allows the existing URIs used to access Blobs, Tables and Queues to continue to work after failover.
1. The PM moves C and D so that they are served by the same PS. The PM then tells the PS to merge (C,D) into E. 2. The PS performs a checkpoint for both C and D, and then briefly pauses traffic to C and D during step 3. 3. The PS uses the MultiModify stream command to create a new commit log and data streams for E. Each of these streams is the concatenation of all of the extents from their respective streams in C and D. This merge means that the extents in the new commit log stream for E will be all of C’s extents in the order they were in C’s commit log stream followed by all of D’s extents in their original order. This layout is the same for the new row and Blob data stream(s) for E.
6. Application Throughput For our cloud offering, customers run their applications as a tenant (service) on VMs. For our platform, we separate computation and storage into their own stamps (clusters) within a data center since this separation allows each to scale independently and control their own load balancing. Here we examine the performance of a customer application running from their hosted service on VMs in the same data center as where their account data is stored. Each VM used is an extra-large VM with full control of the entire compute node and a 1Gbps NIC. The results were gathered on live shared production stamps with internal and external customers.
4. The PS constructs the metadata stream for E, which contains the names of the new commit log and data stream, the combined key range for E, and pointers (extent+offset) for the start and end of the commit log regions in E’s commit log derived from C and D, as well as the root of the data index in E’s data streams. 5. At this point, the new metadata stream for E can be correctly loaded, and the PS starts serving the newly merged RangePartition E.
152
Figure 6 shows the WAS Table operation throughput in terms of the entities per second (y-axis) for 1-16 VMs (x-axis) performing random 1KB single entity get and put requests against a single 100GB Table. It also shows batch inserts of 100 entities at a time – a common way applications insert groups of entities into a WAS Table. Figure 7 shows the throughput in megabytes per second (y-axis) for randomly getting and putting 4MB blobs vs. the number of VMs used (x-axis). All of the results are for a single storage account.
WAS cloud storage service, which they can then access from any XBox console they sign into. The backing storage for this feature leverages Blob and Table storage. The XBox Telemetry service stores console-generated diagnostics and telemetry information for later secure retrieval and offline processing. For example, various Kinect related features running on Xbox 360 generate detailed usage files which are uploaded to the cloud to analyze and improve the Kinect experience based on customer opt-in. The data is stored directly into Blobs, and Tables are used to maintain metadata information about the files. Queues are used to coordinate the processing and the cleaning up of the Blobs. Microsoft’s Zune backend uses Windows Azure for media file storage and delivery, where files are stored as Blobs. Table 1 shows the relative breakdown among Blob, Table, and Queue usage across all (All) services (internal and external) using WAS as well as for the services described above. The table shows the breakdown of requests, capacity usage, and ingress and egress traffic for Blobs, Tables and Queues. Notice that, the percentage of requests for all services shows that about 17.9% of all requests are Blob requests, 46.88% of the requests are Table operations and 35.22% are Queue requests for all services using WAS. But in terms of capacity, 70.31% of capacity is in Blobs, 29.68% of capacity is used by Tables, and 0.01% used by Queues. “%Ingress” is the percentage breakdown of incoming traffic (bytes) among Blob, Table, and Queue; “%Egress” is the same for outbound traffic (bytes). The results show that different customers have very different usage patterns. In term of capacity usage, some customers (e.g., Zune and Xbox GameSaves) have mostly unstructured data (such as media files) and put those into Blobs, whereas other customers like Bing and XBox Telemetry that have to index a lot of data have a significant amount of structured data in Tables. Queues use very little space compared to Blobs and Tables, since they are primarily used as a communication mechanism instead of storing data over a long period of time.
Figure 6 Table Entity Throughput for 1-16 VMs
Figure 7: Blob Throughput for 1-16 VMs These results show a linear increase in scale is achieved for entities/second as the application scales out the amount of computing resources it uses for accessing WAS Tables. For Blobs, the throughput scales linearly up to eight VMs, but tapers off as the aggregate throughput reaches the network capacity on the client side where the test traffic was generated. The results show that, for Table operations, batch puts offer about three times more throughput compared to single entity puts. That is because the batch operation significantly reduces the number of network roundtrips and requires fewer stream writes. In addition, the Table read operations have slightly lower throughput than write operations. This difference is due to the particular access pattern of our experiment, which randomly accesses a large key space on a large data set, minimizing the effect of caching. Writes on the other hand always result in sequential writes to the journal.
Table 1: Usage Comparison for (Blob/Table/Queue) Blob All Table Queue Blob Bing Table Queue Blob XBox Table GameSaves Queue Blob XBox Table Telemetry Queue Blob Zune Table Queue
7. Workload Profiles Usage patterns for cloud-based applications can vary significantly. Section 1 already described a near-real time ingestion engine to provide Facebook and Twitter search for Bing. In this section we describe a few additional internal services using WAS, and give some high-level metrics of their usage.
%Requests %Capacity 17.9 70.31 46.88 29.68 35.22 0.01 0.46 60.45 98.48 39.55 1.06 0 99.68 99.99 0.32 0.01 0 0 26.78 19.57 44.98 80.43 28.24 0 94.64 99.9 5.36 0.1 0 0
%Ingress 48.28 49.61 2.11 16.73 83.14 0.13 99.84 0.16 0 50.25 49.25 0.5 98.22 1.78 0
%Egress 66.17 33.07 0.76 29.11 70.79 0.1 99.88 0.12 0 11.26 88.29 0.45 96.21 3.79 0
8. Design Choices and Lessons Learned Here, we discuss a few of our WAS design choices and relate some of the lessons we have learned thus far.
The XBox GameSaves service was announced at E3 this year and will provide a new feature in Fall 2011 for providing saved game data into the cloud for millions of XBox users. This feature will enable subscribed users to upload their game progress into the
Scaling Computation Separate from Storage – Early on we decided to separate customer VM-based computation from storage for Windows Azure. Therefore, nodes running a customer’s
153
service code are separate from nodes providing their storage. As a result, we can scale our supply of computation cores and storage independently to meet customer demand in a given data center. This separation also provides a layer of isolation between compute and storage given its multi-tenancy usage, and allows both of the systems to load balance independently.
traffic will not). In addition, based on the request history at the AccountName and PartitionName levels, the system determines whether the account has been well-behaving. Load balancing will try to keep the servers within an acceptable load, but when access patterns cannot be load balanced (e.g., high traffic to a single PartitionName, high sequential access traffic, repetitive sequential scanning, etc.), the system throttles requests of such traffic patterns when they are too high.
Given this decision, our goal from the start has been to allow computation to efficiently access storage with high bandwidth without the data being on the same node or even in the same rack. To achieve this goal we are in the process of moving towards our next generation data center networking architecture [10], which flattens the data center networking topology and provides full bisection bandwidth between compute and storage.
Automatic Load Balancing – We found it crucial to have efficient automatic load balancing of partitions that can quickly adapt to various traffic conditions. This enables WAS to maintain high availability in this multi-tenancy environment as well as deal with traffic spikes to a single user’s storage account. Gathering the adaptive profile information, discovering what metrics are most useful under various traffic conditions, and tuning the algorithm to be smart enough to effectively deal with different traffic patterns we see in production were some of the areas we spent a lot of time working on before achieving a system that works well for our multi-tenancy environment.
Range Partitions vs. Hashing – We decided to use range-based partitioning/indexing instead of hash-based indexing (where the objects are assigned to a server based on the hash values of their keys) for the partition layer’s Object Tables. One reason for this decision is that range-based partitioning makes performance isolation easier since a given account’s objects are stored together within a set of RangePartitions (which also provides efficient object enumeration). Hash-based schemes have the simplicity of distributing the load across servers, but lose the locality of objects for isolation and efficient enumeration. The range partitioning allows WAS to keep a customer’s objects together in their own set of partitions to throttle and isolate potentially abusive accounts.
We started with a system that used a single number to quantify “load” on each RangePartition and each server. We first tried the product of request latency and request rate to represent the load on a PS and each RangePartition. This product is easy to compute and reflects the load incurred by the requests on the server and partitions. This design worked well for the majority of the load balancing needs (moving partitions around), but it did not correctly capture high CPU utilization that can occur during scans or high network utilization. Therefore, we now take into consideration request, CPU, and network loads to guide load balancing. However, these metrics are not sufficient to correctly guide splitting decisions.
For these reasons, we took the range-based approach and built an automatic load balancing system (Section 5.5) to spread the load dynamically according to user traffic by splitting and moving partitions among servers. A downside of range partitioning is scaling out access to sequential access patterns. For example, if a customer is writing all of their data to the very end of a table’s key range (e.g., insert key 2011-06-30:12:00:00, then key 2011-06-30:12:00:02, then key 2011-06:30-12:00:10), all of the writes go to the very last RangePartition in the customer’s table. This pattern does not take advantage of the partitioning and load balancing our system provides. In contrast, if the customer distributes their writes across a large number of PartitionNames, the system can quickly split the table into multiple RangePartitions and spread them across different servers to allow performance to scale linearly with load (as shown in Figure 6). To address this sequential access pattern for RangePartitions, a customer can always use hashing or bucketing for the PartitionName, which avoids the above sequential access pattern issue.
For splitting, we introduced separate mechanisms to trigger splits of partitions, where we collect hints to find out whether some partitions are reaching their capacity across several metrics. For example, we can trigger partition splits based on request throttling, request timeouts, the size of a partition, etc. Combining split triggers and the load balancing allows the system to quickly split and load balance hot partitions across different servers. From a high level, the algorithm works as follows. Every N seconds (currently 15 seconds) the PM sorts all RangePartitions based on each of the split triggers. The PM then goes through each partition, looking at the detailed statistics to figure out if it needs to be split using the metrics described above (load, throttling, timeouts, CPU usage, size, etc.). During this process, the PM picks a small number to split for this quantum, and performs the split action on those.
Throttling/Isolation – At times, servers become overloaded by customer requests. A difficult problem was identifying which storage accounts should be throttled when this happens and making sure well-behaving accounts are not affected.
After doing the split pass, the PM sorts all of the PSs based on each of the load balancing metrics - request load, CPU load and network load. It then uses this to identify which PSs are overloaded versus lightly loaded. The PM then chooses the PSs that are heavily loaded and, if there was a recent split from the prior split pass, the PM will offload one of those RangePartitions to a lightly loaded server. If there are still highly loaded PSs (without a recent split to offload), the PM offloads RangePartitions from them to the lightly loaded PSs.
Each partition server keeps track of the request rate for AccountNames and PartitionNames. Because there are a large number of AccountNames and PartitionNames it may not be practical to keep track of them all. The system uses a SampleHold algorithm [7] to track the request rate history of the top N busiest AccountNames and PartitionNames. This information is used to determine whether an account is well-behaving or not (e.g., whether the traffic backs off when it is throttled). If a server is getting overloaded, it uses this information to selectively throttle the incoming traffic, targeting accounts that are causing the issue. For example, a PS computes a throttling probability of the incoming requests for each account based on the request rate history for the account (those with high request rates will have a larger probability being throttled, whereas accounts with little
The core load balancing algorithm can be dynamically “swapped out” via configuration updates. WAS includes scripting language support that enables customizing the load balancing logic, such as defining how a partition split can be triggered based on different system metrics. This support gives us flexibility to fine-tune the load balancing algorithm at runtime as well as try new algorithms according to various traffic patterns observed.
154
Separate Log Files per RangePartition – Performance isolation for storage accounts is critical in a multi-tenancy environment. This requirement is one of the reasons we used separate log streams for each RangePartition, whereas BigTable [4] uses a single log file across all partitions on the same server. Having separate log files enables us to isolate the load time of a RangePartition to just the recent object updates in that RangePartition.
spread evenly across different fault and upgrade domains for the storage service. This way, if a fault domain goes down, we lose at most 1/X of the servers for a given layer, where X is the number of fault domains. Similarly, during a service upgrade at most 1/Y of the servers for a given layer are upgraded at a given time, where Y is the number of upgrade domains. To achieve this, we use rolling upgrades, which enable us to maintain high availability when upgrading the storage service, and we upgrade a single upgrade domain at a time. For example, if we have ten upgrade domains, then upgrading a single domain would potentially upgrade ten percent of the servers from each layer at a time.
Journaling – When we originally released WAS, it did not have journaling. As a result, we experienced many hiccups with read/writes contending with each other on the same drive, noticeably affecting performance. We did not want to write to two log files (six replicas) like BigTable [4] due to the increased network traffic. We also wanted a way to optimize small writes, especially since we wanted separate log files per RangePartition. These requirements led us to the journal approach with a single log file per RangePartition. We found this optimization quite effective in reducing the latency and providing consistent performance.
During a service upgrade, storage nodes may go offline for a few minutes before coming back online. We need to maintain availability and ensure that enough replicas are available at any point in time. Even though the system is built to tolerate isolated failures, these planned (massive) upgrade “failures” can be more efficiently dealt with instead of being treated as abrupt massive failures. The upgrade process is automated so that it is tractable to manage a large number of these large-scale deployments. The automated upgrade process goes through each upgrade domain one at a time for a given storage stamp. Before taking down an upgrade domain, the upgrade process notifies the PM to move the partitions out of that upgrade domain and notifies the SM to not allocate new extents in that upgrade domain. Furthermore, before taking down any servers, the upgrade process checks with the SM to ensure that there are sufficient extent replicas available for each extent outside the given upgrade domain. After upgrading a given domain, a set of validation tests are run to make sure the system is healthy before proceeding to the next upgrade domain. This validation is crucial for catching issues during the upgrade process and stopping it early should an error occur.
Append-only System – Having an append-only system and sealing an extent upon failure have greatly simplified the replication protocol and handling of failure scenarios. In this model, the data is never overwritten once committed to a replica, and, upon failures, the extent is immediately sealed. This model allows the consistency to be enforced across all the replicas via their commit lengths. Furthermore, the append-only system has allowed us to keep snapshots of the previous states at virtually no extra cost, which has made it easy to provide snapshot/versioning features. It also has allowed us to efficiently provide optimizations like erasure coding. In addition, append-only has been a tremendous benefit for diagnosing issues as well as repairing/recovering the system in case something goes wrong. Since the history of changes is preserved, tools can easily be built to diagnose issues and to repair or recover the system from a corrupted state back to a prior known consistent state. When operating a system at this scale, we cannot emphasize enough the benefit we have seen from using an append-only system for diagnostics and recovery.
Multiple Data Abstractions from a Single Stack – Our system supports three different data abstraction from the same storage stack: Blobs, Tables and Queues. This design enables all data abstractions to use the same intra-stamp and inter-stamp replication, use the same load balancing system, and realize the benefits from improvements in the stream and partition layers. In addition, because the performance needs of Blobs, Tables, and Queues are different, our single stack approach enables us to reduce costs by running all services on the same set of hardware. Blobs use the massive disk capacity, Tables use the I/O spindles from the many disks on a node (but do not require as much capacity as Blobs), and Queues mainly run in memory. Therefore, we are not only blending different customer’s workloads together on shared resources, we are also blending together Blob, Table, and Queue traffic across the same set of storage nodes.
An append-based system comes with certain costs. An efficient and scalable garbage collection (GC) system is crucial to keep the space overhead low, and GC comes at a cost of extra I/O. In addition, the data layout on disk may not be the same as the virtual address space of the data abstraction stored, which led us to implement prefetching logic for streaming large data sets back to the client. End-to-end Checksums – We found it crucial to keep checksums for user data end to end. For example, during a blob upload, once the Front-End server receives the user data, it immediately computes the checksum and sends it along with the data to the backend servers. Then at each layer, the partition server and the stream servers verify the checksum before continuing to process it. If a mismatch is detected, the request is failed. This prevents corrupted data from being committed into the system. We have seen cases where a few servers had hardware issues, and our endto-end checksum caught such issues and helped maintain data integrity. Furthermore, this end-to-end checksum mechanism also helps identify servers that consistently have hardware issues so we can take them out of rotation and mark them for repair.
Use of System-defined Object Tables – We chose to use a fixed number of system defined Object Tables to build Blob, Table, and Queue abstractions instead of exposing the raw Object Table semantics to end users. This decision reduces management by our system to only the small set of schemas of our internal, system defined Object Tables. It also provides for easy maintenance and upgrade of the internal data structures and isolates changes of these system defined tables from end user data abstractions. Offering Storage in Buckets of 100TBs – We currently limit the amount of storage for an account to be no more than 100TB. This constraint allows all of the storage account data to fit within a given storage stamp, especially since our initial storage stamps held only two petabytes of raw data (the new ones hold 20-30PB). To obtain more storage capacity within a single data center, customers use more than one account within that location. This
Upgrades – A rack in a storage stamp is a fault domain. A concept orthogonal to fault domain is what we call an upgrade domain (a set of servers briefly taken offline at the same time during a rolling upgrade). Servers for each of the three layers are
155
ended up being a reasonable tradeoff for many of our large customers (storing petabytes of data), since they are typically already using multiple accounts to partition their storage across different regions and locations (for local access to data for their customers). Therefore, partitioning their data across accounts within a given location to add more storage often fits into their existing partitioning design. Even so, it does require large services to have account level partitioning logic, which not all customers naturally have as part of their design. Therefore, we plan to increase the amount of storage that can be held within a given storage account in the future.
interactions. The system provides a programmable interface for all of the main operations in our system as well as the points in the system to create faults. Some examples of these pressure point commands are: checkpoint a RangePartition, combine a set of RangePartition checkpoints, garbage collect a RangePartition, split/merge/load balance RangePartitions, erasure code or unerasure code an extent, crash each type of server in a stamp, inject network latencies, inject disk latencies, etc. The pressure point system is used to trigger all of these interactions during a stress run in specific orders or randomly. This system has been instrumental in finding and reproducing issues from complex interactions that might have taken years to naturally occur on their own.
CAP Theorem – WAS provides high availability with strong consistency guarantees. This combination seems to violate the CAP theorem [2], which says a distributed system cannot have availability, consistency, and partition tolerance at the same time. However, our system, in practice, provides all three of these properties within a storage stamp. This situation is made possible through layering and designing our system around a specific fault model.
9. Related Work Prior studies [9] revealed the challenges in achieving strong consistency and high availability in a poorly-connected network environment. Some systems address this by reducing consistency guarantees to achieve high availability [22,14,6]. But this shifts the burden to the applications to deal with conflicting views of data. For instance, Amazon’s SimpleDB was originally introduced with an eventual consistency model and more recently added strongly consistent operations [23]. Van Renesse et. al. [20] has shown, via Chain Replication, the feasibility of building largescale storage systems providing both strong consistency and high availability, which was later extended to allow reading from any replica [21]. Given our customer needs for strong consistency, we set out to provide a system that can provide strong consistency with high availability along with partition tolerance for our fault model.
The stream layer has a simple append-only data model, which provides high availability in the face of network partitioning and other failures, whereas the partition layer, built upon the stream layer, provides strong consistency guarantees. This layering allows us to decouple the nodes responsible for providing strong consistency from the nodes storing the data with availability in the face of network partitioning. This decoupling and targeting a specific set of faults allows our system to provide high availability and strong consistency in face of various classes of failures we see in practice. For example, the type of network partitioning we have seen within a storage stamp are node failures and top-of-rack (TOR) switch failures. When a TOR switch fails, the given rack will stop being used for traffic — the stream layer will stop using that rack and start using extents on available racks to allow streams to continue writing. In addition, the partition layer will reassign its RangePartitions to partition servers on available racks to allow all of the data to continue to be served with high availability and strong consistency. Therefore, our system is designed to be able to provide strong consistency with high availability for the network partitioning issues that are likely to occur in our system (at the node level as well as TOR failures).
As in many other highly-available distributed storage systems [6,14,1,5], WAS also provides geo-redundancy. Some of these systems put geo-replication on the critical path of the live application requests, whereas we made a design trade-off to take a classical asynchronous geo-replication approach [18] and leave it off the critical path. Performing the geo-replication completely asynchronously allows us to provide better write latency for applications, and allows more optimizations, such as batching and compaction for geo-replication, and efficient use of cross-data center bandwidth. The tradeoff is that if there is a disaster and an abrupt failover needs to occur, then there is unavailability during the failover and a potential loss of recent updates to a customer’s account.
High-performance Debug Logging – We used an extensive debug logging infrastructure throughout the development of WAS. The system writes logs to the local disks of the storage nodes and provides a grep-like utility to do a distributed search across all storage node logs. We do not push these verbose logs off the storage nodes, given the volume of data being logged.
The closest system to ours is GFS [8,15] combined with BigTable [4]. A few differences from these prior publications are: (1) GFS allows relaxed consistency across replicas and does not guarantee that all replicas are bitwise the same, whereas WAS provides that guarantee, (2) BigTable combines multiple tablets into a single commit log and writes them to two GFS files in parallel to avoid GFS hiccups, whereas we found we could work around both of these by using journaling in our stream layer, and (3) we provide a scalable Blob storage system and batch Table transactions integrated into a BigTable-like framework. In addition, we describe how WAS automatically load balances, splits, and merges RangePartitions according to application traffic demands.
When bringing WAS to production, reducing logging for performance reasons was considered. The utility of verbose logging though made us wary of reducing the amount of logging in the system. Instead, the logging system was optimized to vastly increase its performance and reduce its disk space overhead by automatically tokenizing and compressing output, achieving a system that can log 100’s of MB/s with little application performance impact per node. This feature allows retention of many days of verbose debug logs across a cluster. The highperformance logging system and associated log search tools are critical for investigating any problems in production in detail without the need to deploy special code or reproduce problems.
10. Conclusions The Windows Azure Storage platform implements essential services for developers of cloud based solutions. The combination of strong consistency, global partitioned namespace, and disaster recovery has been important customer features in WAS’s multitenancy environment. WAS runs a disparate set of workloads with
Pressure Point Testing – It is not practical to create tests for all combinations of all complex behaviors that can occur in a large scale distributed system. Therefore, we use what we call Pressure Points to aid in capturing these complex behaviors and
156
various peak usage profiles from many customers on the same set of hardware. This significantly reduces storage cost since the amount of resources to be provisioned is significantly less than the sum of the peak resources required to run all of these workloads on dedicated hardware.
[4] F. Chang et al., "Bigtable: A Distributed Storage System for Structured Data," in OSDI, 2006. [5] B. Cooper et al., "PNUTS: Yahoo!'s Hosted Data Serving Platform," VLDB, vol. 1, no. 2, 2008. [6] G. DeCandia et al., "Dynamo: Amazon's Highly Available Key-value Store," in SOSP, 2007.
As our examples demonstrate, the three storage abstractions, Blobs, Tables, and Queues, provide mechanisms for storage and workflow control for a wide range of applications. Not mentioned, however, is the ease with which the WAS system can be utilized. For example, the initial version of the Facebook/Twitter search ingestion engine took one engineer only two months from the start of development to launching the service. This experience illustrates our service's ability to empower customers to easily develop and deploy their applications to the cloud.
[7] Cristian Estan and George Varghese, "New Directions in Traffic Measurement and Accounting," in SIGCOMM, 2002. [8] S. Ghemawat, H. Gobioff, and S.T. Leung, "The Google File System," in SOSP, 2003. [9] J. Gray, P. Helland, P. O'Neil, and D. Shasha, "The Dangers of Replication and a Solution," in SIGMOD, 1996.
Additional information on Windows Azure and Windows Azure Storage is available at http://www.microsoft.com/windowsazure/.
[10] Albert Greenberg et al., "VL2: A Scalable and Flexible Data Center Network," Communications of the ACM, vol. 54, no. 3, pp. 95-104, 2011.
Acknowledgements
[11] Y. Hu and Q. Yang, "DCD—Disk Caching Disk: A New Approach for Boosting I/O Performance," in ISCA, 1996.
We would like to thank Geoff Voelker, Greg Ganger, and anonymous reviewers for providing valuable feedback on this paper.
[12] H.T. Kung and John T. Robinson, "On Optimistic Methods for Concurrency Control," ACM Transactions on Database Systems, vol. 6, no. 2, pp. 213-226, June 1981.
We would like to acknowledge the creators of Cosmos (Bing’s storage system): Darren Shakib, Andrew Kadatch, Sam McKelvie, Jim Walsh and Jonathan Forbes. We started Windows Azure 5 years ago with Cosmos as our intra-stamp replication system. The data abstractions and append-only extent-based replication system presented in Section 4 was created by them. We extended Cosmos to create our stream layer by adding mechanisms to allow us to provide strong consistency in coordination with the partition layer, stream operations to allow us to efficiently split/merge partitions, journaling, erasure coding, spindle anti-starvation, read load-balancing, and other improvements.
[13] Leslie Lamport, "The Part-Time Parliament," ACM Transactions on Computer Systems, vol. 16, no. 2, pp. 133169, May 1998. [14] A. Malik and P. Lakshman, "Cassandra: a decentralized structured storage system," SIGOPS Operating System Review, vol. 44, no. 2, 2010. [15] M. McKusick and S. Quinlan, "GFS: Evolution on Fastforward," ACM File Systems, vol. 7, no. 7, 2009. [16] S. Mysore, B. Agrawal, T. Sherwood, N. Shrivastava, and S. Suri, "Profiling over Adaptive Ranges," in Symposium on Code Generation and Optimization, 2006.
We would also like to thank additional contributors to Windows Azure Storage: Maneesh Sah, Matt Hendel, Kavitha Golconda, Jean Ghanem, Joe Giardino, Shuitao Fan, Justin Yu, Dinesh Haridas, Jay Sreedharan, Monilee Atkinson, Harshawardhan Gadgil, Phaneesh Kuppahalli, Nima Hakami, Maxim Mazeev, Andrei Marinescu, Garret Buban, Ioan Oltean, Ritesh Kumar, Richard Liu, Rohit Galwankar, Brihadeeshwar Venkataraman, Jayush Luniya, Serdar Ozler, Karl Hsueh, Ming Fan, David Goebel, Joy Ganguly, Ishai Ben Aroya, Chun Yuan, Philip Taron, Pradeep Gunda, Ryan Zhang, Shyam Antony, Qi Zhang, Madhav Pandya, Li Tan, Manish Chablani, Amar Gadkari, Haiyong Wang, Hakon Verespej, Ramesh Shankar, Surinder Singh, Ryan Wu, Amruta Machetti, Abhishek Singh Baghel, Vineet Sarda, Alex Nagy, Orit Mazor, and Kayla Bunch.
[17] P. O'Neil, E. Cheng, D. Gawlick, and E. O'Neil, "The LogStructured Merge-Tree (LSM-tree)," Acta Informatica ACTA, vol. 33, no. 4, 1996. [18] H. Patterson et al., "SnapMirror: File System Based Asynchronous Mirroring for Disaster Recovery," in USENIX-FAST, 2002. [19] Irving S. Reed and Gustave Solomon, "Polynomial Codes over Certain Finite Fields," Journal of the Society for Industrial and Applied Mathematics, vol. 8, no. 2, pp. 300304, 1960. [20] R. Renesse and F. Schneider, "Chain Replication for Supporting High Throughput and Availability," in USENIXOSDI, 2004.
Finally we would like to thank Amitabh Srivastava, G.S. Rana, Bill Laing, Satya Nadella, Ray Ozzie, and the rest of the Windows Azure team for their support.
[21] J. Terrace and M. Freedman, "Object Storage on CRAQ: High-throughput chain replication for read-mostly workloads," in USENIX'09, 2009.
Reference
[22] D. Terry, K. Petersen M. Theimer, A. Demers, M. Spreitzer, and C. Hauser, "Managing Update Conflicts in Bayou, A Weakly Connected Replicated Storage System," in ACM SOSP, 1995.
[1] J. Baker et al., "Megastore: Providing Scalable, Highly Available Storage for Interactive Services," in Conf. on Innovative Data Systems Research, 2011. [2] Eric A. Brewer, "Towards Robust Distributed Systems. (Invited Talk)," in Principles of Distributed Computing, Portland, Oregon, 2000.
[23] W. Vogel, "All Things Distributed - Choosing Consistency," in http://www.allthingsdistributed.com/2010/02/strong_consist ency_simpledb.html, 2010.
[3] M. Burrows, "The Chubby Lock Service for LooselyCoupled Distributed Systems," in OSDI, 2006.
157
A Smorgasbord of Embedded and Pervasive Computing Research Kishore Ramachandran
(part of systems group which includes Ada Gavrilovska, Taesoo Kim, Ling Liu, Calton Pu, and Alexey Tumanov)
Current PhD inmates!
Tyler Landle
Manasvini Sethuraman
Anirudh Sarma
Alan Nussbaum
Difei Cao
Jinsun Yoo
Recently escaped!
Hyojun Kim (IBM Almaden then Startup and now Google); Lateef Yusuf (Amazon then Google); Mungyung Ryu (Google); Kirak Hong (Google and now CTRL-labs); Dave Lillethun (Seattle U.);
Dushmanta Mohapatra (Oracle); Wonhee Cho (Microsoft); Beate Ottenwalder (Bosch); Ruben Mayer (TU Munich), Ashish Bijlani (Startup)
Plus a number of MS and UGs
Embedded Pervasive Lab
Pervasive side of the house Embedded devices treated as black boxes System Support for IoT
Fog/Edge computing
Current-Generation Applications
5
Next-Generation Applications
6
Next-Generation Applications • Sense -> Process -> Actuate • Common Characteristics • Dealing with real-world data streams • Real-time interaction among mobile devices • Wide-area analytics
• Requirements • Dynamic scalability • Low-latency communication • Efficient in-network processing
Cloud Computing ● Good for web apps at human perception speeds ● Throughput oriented web apps with human in the loop ● Not good for many latency-sensitive IoT apps at computational perception speeds ● sense -> process -> actuate ● Other considerations ● Limited by backhaul bandwidth for transporting plethora of 24x7 sensor streams ● Not all sensor streams meaningful => Quench the streams at the source ● Privacy and regulatory requirements
Fog/Edge Computing • Extending the cloud utility computing to the edge • Provide utility computing using resources that are • Hierarchical • Geo-distributed Mobile / Sensor Edge Core
Fog/edge computing today • Edge is slave of the Cloud • Platforms: IoT Azure Edge, CISCO Iox, Intel FRD, …
• Mobile apps beholden to the Cloud
Vision for the future
• Elevate Edge to be a peer of the Cloud • Prior art: Cloudlets (CMU+Microsoft), MAUI (Microsoft)
• In the limit • Make the Edge autonomous even if disconnected from the Cloud
Why
?
• Interacting entities (e.g., connected vehicles) connected to different edge nodes • Horizontal (p2p) interactions among edge nodes essential
Why
?
• Autonomy of edge (disaster recovery)
Challenges for making • Need for powerful frameworks akin to the Cloud at the edge • Programming models, storage abstractions, pub/sub systems, …
• Geo-distributed data replication and consistency models • Heterogeneity of network resources • Resilience to coordinated power failures
• Rapid deployment of application components, multi-tenancy, and elasticity at the edge • Cognizant of limited computational, networking, and storage resources
Thoughts on Meeting the Challenges (https://www.cc.gatech.edu/~rama/recent_pubs.html) Theme: Elevating the Edge to be a peer of the Cloud • Geo-distributed programming model for Edge/Cloud continuum • OneEdge (ACM SoCC 2021) • Foglets (ACM DEBS 2016)
• Geo-distributed data management – replica placement, migration and consistency • EPulsar (ACM SEC 2021) • FogStore (ACM DEBS 2018) • DataFog (HotEdge 2018)
• Efficient Edge runtimes • Serverless functions using WebAssembly (ACM IoTDI 2019)
• Applications using autonomous Edge • • • •
Social Sensing sans Cloud (SocialSens 2017) STTR: Space Time Trajectory Registration (ACM DEBS 2018) STVT: Space-Time Vehicle Tracking (HotVideoEdge 2019) Coral-Pie: Space-Time Vehicle Tracking at the Edge (ACM Middleware 2020)
• Vision: “A case for elevating the edge to be a peer of the cloud”, GetMobile, 2020 • Vision: "eCloud: Vision for the Evolution of Edge-Cloud Continuum", IEEE Computer, 2021.
Ongoing Work • eCloudSim (with Daglis, Chatterjee, Dhekne) • eCloud Simulation Infrastructure
• Prescient video prefetching at the edge for AV infotainment (With Dhekne) • Use route to JIT prefetching and caching for DASH player on vehicle •
Foresight (ACM MMSys 2021)
• Use mmWAVE (integrated with 5G LTE for edge node selection) to beam to passing vehicle •
ClairvoyantEdge (in submission)
• Edge centric video data management systems for AV (With Arulraj) • Annotations with video for query processing, multi-tenancy, and sharing •
EVA (To appear in ACM SIGMOD)
• Nimble execution environments for the Edge • Analyze cold start times in containers • Clean slate exec environment for FaaS
• NFSlicer: dataplane optimizations for processing network functions (With Daglis) • Selective data movement (e.g., header vs. payload) for NF chaining •
NFSlicer (in submission)
• Performance isolation in the Edge (with Sameh Elnikety – Microsoft) • Harvest Container: Mixing latency-sensitive and throughput-oriented apps at the edge
• Hardware accelerators in micro-datacenters (with Tushar Krishna) • Efficiently using accelerators to improve efficiency in edge datacenters.
16
Embedded side of the house
Infinite storage for mobile devices Optimizing Mobile Video Downloads
Infinite Storage for Mobile Devices
Seamlessly extend the storage on mobile to the Cloud for any app
User space file system APSys 2018, USENIX ATC 2019, Sigmetrics 2021
Use machine learning to build user’s everyday working set and (off)load (un)wanted data Issues
Latency Energy consumption Security and privacy
Optimizing Mobile Video Downloads
Foresight (ACM MMSys 2021)
Bandwidth prediction across space and time for mobile users
ClairvoyantEdge
Short range mmWave augmentation at Edge for high bandwidth video delivery
Recap Infinite storage for mobile devices Optimizing Mobile Video Downloads
Foresight, ClairvoyantEdge
Fog/Edge computing
eCloud Foglets, OneEdge, thin virtualization for FaaS Fogstore, DataFog, ePulsar, NFSlicer STTR, Socialsens
Ongoing Projects
eCloud: Device-Edge-Cloud continuum OneEdge: Device/Edge/Cloud control plane using AV as exemplar
Foresight and ClairvoyantEdge: Prescient video prefetching at the edge for AV infotainment (With Prof. Dhekne)
Use route to JIT prefetching and caching for DASH player on vehicle Use mmWAVE (integrated with 5G LTE for edge node selection) to beam to passing vehicle
Annotations with video for query processing, multi-tenancy, and sharing
Edge centric video data management systems for AV (With Prof. Arulraj) Nimble execution environments for the Edge
Analyze cold start times in containers Clean slate exec environment for FaaS
Selective data movement (e.g., header vs. payload) for NF chaining
Smart information services without WAN connectivity
Scheduling edge resources, monitoring, migration
NFSlicer: dataplane for processing network functions (With Prof. Daglis)
MicroEdge: Low-cost edge architecture for camera processing (With Prof. Krishna) Edge computing solution for underserved communities
21
Pubs: http://www.cc.gatech.edu/~rama/recent_pub s.html E-mail: [email protected]
Lab: http://wiki.cc.gatech.edu/epl
What does Kishore “really” do in his copious spare time when he is not teaching?
Squash anyone?
Table tennis anyone?
What you should take away?
“Kishore” rhymes with “sea shore” Squash/Table-tennis EPL Fog/Edge computing Infinite storage on mobile/EdgeCaching
https://www.dreamstime.com/namaste-vector-hand-drawn-symbol-yoga-design-image112101574
Extra slides on EPL projects underway Kishore Ramachandran
Camera Processing @ Edge Two topics
Serverless @ Edge
Camera Network Context
Streaming 24 x 7 Used by multiple applications Processing at ingestion time Processing at Edge
Applications Suspicious vehicle tracking Smart traffic light Camera-assisted navigation for the disability
Motivating Application Space-Time Vehicle Tracking at video ingestion time Manual Checking Labor-intensive Error-prone due to lapses in attention Intelligent systems Replace manual labor Aid human decision-making
Outline Recent work on camera applications at Edge Coral-Pie: Space-Time Vehicle Tracking at the Edge (ACM Middleware 2020)
Proposed Work in a nutshell Camera Networks at the Edge • MicroEdge: multi-tenancy edge system architecture • Accelerator orchestration in MicroEdge
Serverless at the Edge • Nimble orchestration for FaaS • Cross-site load balancing • Location-aware auto-scaling
Recent Work Novel Principles "Scalable-by-design" camera processing at ingestion time Per-camera latency-based subtask placement Bounded network communication per camera
Fault tolerance for geo-distributed camera network Automatic camera topology management
Coral-Pie: Device-Edge-Cloud Continuum Cloud Camera Topology Server
Edge Storage Edge
Device: Storage
• Camera + 2 RPis + TPU
Heavy-lifting real-time camera processing
Device
• Deterministic latency bounds • No pressure on the backhaul network bandwidth
Vehicle Identification Communication
Vehicle Re-identification
Pipeline
Pipelined processing to exploit parallelism
Other Devices
...
Coral-Pie: Device-Edge-Cloud Continuum Cloud Camera Topology Server
Edge Storage
Edge:
Storage
• A multi-tenant micro-datacenter housed in a small footprint location • Few network hop(s) away from Devices
Edge
Offloading storage
Device Vehicle Identification Communication
Vehicle Re-identification
Pipeline
• Asynchronously • Helps balance the pipeline of tasks on the Devices • Backhaul wide-area network (WAN) is not pressured
Other Devices
...
Coral-Pie: Device-Edge-Cloud Continuum Cloud Camera Topology Server
Edge Storage
Cloud Edge • A centralized service provider that all devices can connect to • Global knowledge
Storage Device Vehicle Identification Communication
Vehicle Re-identification
Pipeline
Maintains geographical relationship between the cameras
Other Devices
• Infrequently updated (i.e., a new camera is deployed or an old camera is removed)
...
Coral-Pie in Action
Camera Topology Server
C
B
A
Cloud: Management of Camera Topology Camera Topology Server JOIN: [33.778406, -84.401304]
JOIN: [33.778279, -84.399226]
C
JOIN: [33.778254, -84.397794]
B
A
Cloud: Management of Camera Topology Camera Topology Server Topo: {‘⟶’: ‘B’}
Topo: {‘⟵’: ‘C’, ‘⟶’: ‘A’}
C
Topo: {‘⟵’: ‘B’}
B
A
Device: Vehicle Identification
C
B
A
DetectionEvent_A_0: { camera: A, timestamp: 18:19-07/10/2019, tracklet: [(bbox, frameId),...], features: { moving_direction: 270 (⟵), histogram: np.array([...]) } }
Device: Communication
C
B
Camera B’s Candidate Pool: : DetectionEvent_A_0:
A
DetectionEvent_A_0: { camera: A, timestamp: 18:19-07/10/2019, tracklet: [(bbox, frameId),...], features: { moving_direction: 270 (⟵), histogram: np.array([...]) } }
Device: Communication
C
B
Camera B’s Candidate Pool: : DetectionEvent_A_0:
: DetectionEvent_A_2:
A
DetectionEvent_A_2: { camera: A, timestamp: 18:19:09-07/10/2019, tracklet: [(bbox, frameId),...], features: { moving_direction: 270 (⟵), histogram: np.array([...]) } }
Device: Vehicle Re-Identification at Camera B C
B
Camera B’s Candidate Pool: : DetectionEvent_A_0:
: DetectionEvent_A_2:
Matching!
A
DetectionEvent_B_7: { camera: B, timestamp: 18:19:26-07/10/2019, tracklet: [(bbox, frameId),...], features: { moving_direction: 270 (⟵), histogram: np.array([...]) } }
Device: Vehicle Re-Identification at Camera B C
B
Camera B’s Candidate Pool: : DetectionEvent_A_0:
: DetectionEvent_A_2:
Matching!
A
DetectionEvent_B_9: { camera: B, timestamp: 18:19:29-07/10/2019, tracklet: [(bbox, frameId),...], features: { moving_direction: 180(↓), histogram: np.array([...]) } }
Device: Vehicle Re-Identification at Camera C
C
B
Camera B’s Candidate Pool: : DetectionEvent_B_7:
Matching!
A
DetectionEvent_C_12: { camera: C, timestamp: 18:19:46-07/10/2019, tracklet: [(bbox, frameId),...], features: { moving_direction: 315 (↑), histogram: np.array([...]) } }
Edge: Storage
C
DetectionEvent_C_12
B
A
• Trajectory Storage as Graph • Frame Storage - traceability
DetectionEvent_B_7
DetectionEvent_A_0
DetectionEvent_B_9
DetectionEvent_A_2
From Coral-Pie to MicroEdge Limitations of Coral-Pie Dedicated hardware resources for each camera processing pipeline
Each pipeline may not use the resources all the time
Camera network intended for multiple applications Opportunity Share TPU, CPU, and memory resources within and across applications
Limitations of the State-of-the-art Prior art focus on GPU management in resource-rich clusters NVIDIA Docker and NVIDIA MPS Clockwork[1], Heimdall[2], Infaas[3], and Clipper[4]
Lack of support for multiplexing TPU resources across containers Missed opportunity in container-orchestration systems (e.g., Kubernetes, K3s) Under-utilization of TPU resources
[1] Gujarati, Arpan et al. “Serving DNNs like Clockwork: Performance Predictability from the Bottom Up.” OSDI (2020). [2] Yi, Juheon and Youngki Lee. “Heimdall: mobile GPU coordination platform for augmented reality applications.” Proceedings of the 26th Annual International Conference on Mobile Computing and Networking (2020). [3] Romero, Francisco et al. “INFaaS: Automated Model-less Inference Serving.” USENIX Annual Technical Conference (2021). [4] Crankshaw, Daniel et al. “Clipper: A Low-Latency Online Prediction Serving System.” NSDI (2017).
MicroEdge Vision A low-cost edge cluster serves computational resources for a set of geo-local cameras. A performant multi-tenancy architecture. Request: 2 CPU, 1GB memory, 0.5 TPU
Request: 2 CPU, 2GB memory, 1 TPU Request: 1 CPU, 1GB memory, 0.3 TPU
Objectives of MicroEdge System Architecture Share TPU resources across independent application pipelines Performance isolation across independent application pipelines Load-balance incoming inference requests across TPU resources Co-compile different models on TPUs to reduce model swapping overhead
MicroEdge System Architecture Research Questions How do we extend the state of the art?
Extensions to K3s's control plane Orchestrate TPU resources on creation and destruction of application pods Admission Control Key components: Extended scheduler, TPU co-complier, Resource reclamation Extensions to K3s's data plane Mechanisms to multiplex TPU resources across applications Monitoring TPU usage Key components: TPU service, TPU service client library, Load balancing service
Evaluation Plan Application Study
Space-time vehicle tracking
Can MicroEdge improve the resource utilization while meeting the SLA of application?
Trace-based Study
Azure Function Traces [1]
Map to ephemeral application pipelines
Can MicroEdge effectively allocate and reclaim TPU resources?
[1] Shahrad, Mohammad, et al. "Serverless in the wild: Characterizing and optimizing the serverless workload at a large cloud provider." 2020 USENIX Annual Technical Conference (USENIX ATC 20). 2020.
Motivation: Workload at the edge is highly variable with time Dedicating resources leads to overprovisioning Scarce resources at the Edge
Original vision for Coral-Pie was to create spacetime track for all vehicles all the time But most of the time, we are interested only in "suspicious" vehicles Increases the chance for resource multiplexing Number of interesting objects detected is sporadic
=> Move toward Serverless at the Edge
Average Objects per Frame
Serverless at the Edge for camera networks
Image source: Zhang, Miao, et al. "Towards cloud-edge collaborative online video analytics with fine-grained serverless pipelines." Proceedings of the 12th ACM Multimedia Systems Conference. 2021.
Limitations of State-of-the-Art Cloud-based FaaS platforms OpenFaaS, KNative (backed by K8s) High Latency for OpenFaaS in MicroEdge ==> High container creation overhead Multiple WAN traversals ==> High WAN overhead Location-agnostic scheduling
Ingress Gatewa y Wide-Area Network
Client Client Client
Worke Worke Worke r rr Edge Site 1
Worke Worke Worke r rr Edge Site 2
WAN latency = 80 ms RTT
Limitations of State-of-the-Art Cloud-based FaaS platforms
OpenFaaS, KNative (backed by K8s) High Latency for OpenFaaS in MicroEdge Multiple WAN traversals Location-agnostic scheduling
Edge-specific FaaS platforms Mu (SoCC'21)
• Fine-grained queue-size based load-balancing to meet latency requirement
CEVAS (MMSys'21)
• Optimal partitioning of serverless video processing between edge and cloud
Both platforms focus only on resources of 1 edge site Load-balancing and Scaling policies are unaware of load on other sites
Edge Site 1
Edge Site 2
Ingress Gateway
Ingress Gateway
Worker
Worker
Client Client Client
Worker
Worker
Client Client Client
Objectives for Serverless @ Edge Cross-site load-balancing of function invocation requests
Location-aware auto-scaling Nimble edge execution environment Handle heterogeneous network latencies between edge sites
Inter-edge-site latencies Image source: Xu, Mengwei, et al. "From cloud to edge: a first look at public edge platforms." Proceedings of the 21st ACM Internet Measurement Conference. 2021.
Initial set of Research Questions Cross-site load balancing policy Whether and to which site to offload?
Global autoscaling policy How to predict spatio-temporal distribution of client requests? Where to provision function containers to minimize resource wastage and meet application requirement?
Monitoring What metrics are needed to enable above policies? How to efficiently and scalably propagate per-site system state to other sites?
Agility-oriented optimizations in Kubernetes to reduce cold start overhead Network proximity monitoring How to do so accurately, efficiently, and at scale? (Network coordinates?)
Evaluation Plan Load balancing policy Proposed network proximity and load-aware approach Compare with • Coarse-grained monitoring (e.g., uniform load distribution a la OpenFaaS) • Fine-grained monitoring (e.g., fine-grained queue-sizes a la Mu)
Autoscaling policy Proposed spatio-temporal demand distribution based approach Compare with • Techniques based on CPU utilization/request rate (a la OpenFaaS) and latencyaware (a la Mu)
Network proximity Is the use of network coordinates accurate, efficient, and scalable ?
Expected Contributions of the Proposed Research MicroEdge Control and data-plane mechanisms for orchestrating accelerator resources Policies for efficient allocation of accelerator resources Admission Control and Monitoring policies
Serverless @ Edge Cross-site load and latency aware load-balancing policy
Spatio-temporal demand based cross-site autoscaling policy Integrating network coordinates for network proximity monitoring
END