Automated Multi Tenant Data Center Network
An automated multi-tenant data center network is an increasingly desired end goal for large and small organizations including providers. Servers that house the CPU, RAM and Hard Disk resources are serving traffic for applications they host. These servers need connectivity among themselves within the data center and also towards the outside world.
At first an organized set of CPU/RAM/HD Servers are connected to a network device. This happens in a Data Center rack and the network device is a ToR, a Top of Rack switch. Another similar set of servers is connected to another Top of Rack network device. Multiple such sets of servers/network device pods are then linked together. The incumbent way to do this would be to make a leaf-spine Clos fabric. The layer of network devices connecting the servers are the leaf layer and the layer of network devices that is connecting these leaf nodes is the spine layer.
Hardware is thus laid out in a 2-stage or 3-stage Clos fabric and then we need to lay out a logical control plane to pass traffic. Applications on the Server CPU/RAM/HD will talk to each other within the DC which is east-west traffic or to the outside world which can be called north-south traffic.
Depending on the type of application east west traffic could be higher but north south traffic is always present.
Moving bits from a server to any other location is the networks job. These bits could be a compute hosting virtual machine’s bits or a ‘Serverless’ cloud application’s bits but they go somewhere and are moving. They are moved by the network layer regardless of what resides on the servers.
How many layers of protocols and software are required to provide for an automated multi-tenant data center network which can connect servers, host applications and provide east-west/north-south connectivity ?
In the Networking Components blog post some basic networking components were listed out in a different construct: Network Device, Protocols, Protocol Messages, Addresses, Lookup tasks, Identity Tags, Filters & Actions, Network Over Network ( Overlay) Appended Information, Network + Network , Network Inside Network Device, Control and Data Plane.
In the Event-Driven Network Automation blog automation details were described.
The below will make some use of the networking components and event-driven network automation blog posts.
At first you need Addresses appended onto payload bits to ascertain endpoints and exchange traffic. How many layers of addresses will be required to connect the servers to each other over a fabric? In a full mesh structure the networking layer is small/direct and less addresses are required. In a Clos Leaf-Spine-Leaf fabric there needs to be multiple layers of addresses required.
A packet/frame structured bits data structure is switched across multiple nodes. In terms of Addresses Ethernet MACs are used for Layer 2 connectivity between servers NICs and ToR ports. The server could also have an IP Address of its own and be performing Layer 3 communications.
One server connected with one leaf could send an IP packet to another server connected with another leaf (Server<>Leaf<>Spine<>Leaf<>Server). As parts of the Control Plane of laying out the fabric the leaf and spine network devices will have IP addresses of their own which will speak to each other and send Control Plane Protocol Messages. What this infers is that there will be present 2 layers of IP communications. One between the network nodes themselves and one between the servers. This infers the requirement to have an IP address pushed on to another IP address in a tunnel type structure where from one network device to another (e.g. leaf to leaf via spine) the packet is routed based on Outer IP Addresses and the inner address is used by the server. Therefore some packets will require an addressing structure such as IP|Eth|IP|Eth. The IP Tunnel will span from a Leaf to another Leaf via the Spine, therefore the tunnel endpoints are at the Leaf switches. (Server-IP<encapsulation>Leaf-IP<>via Spine <>Leaf-IP<decapsulation>Server-IP)
We have multiple combinations of communications to deal with in multiple layers of the networking stack. Leaf-Local L2, Leaf-Local L3, Leaf-Spine, Leaf-Spine-Leaf L2, Leaf-Spine-Leaf L3. All this calls for multiple domains. A ‘Local’ Link Layer Domain, A Local Network Layer Domain, A Distant Network Layer Domain, A relatively distant Link Layer Domain. A link layer domain could be an L2 VLAN/broadcast domain or a bridge domain and a network layer domain could be a local VRF or a wider-spanning IP-in-IP domain level routing instance.
… A routed layer has IP addresses at two endpoints and an Ethernet link has MAC addresses at two end points. A virtual machine of a tenant in a server can have both an IP address and a MAC address. There could also be a single virtual machine having multiple subnets IPs behind the same MAC address ethernet link. This virtual machine is an endpoint and is this is what the network layer needs to provide connectivity to. Therefore we could say that an endpoint requires at least 2 tables at the network device it is connecting to. An IP Routing table and a MAC table. An ARP table is also required for Inter-Layer discovery. There is also the Leaf-Spine-Leaf IP-in-IP tunnel we spoke about which adds another layer of overlay Routing Table. In addition an outer IP to inner IP socket-style mapping function will be required which is another table (L4 Socket of Outer IP to Inner IP).
Discovering the places of destination-address lookup-actions happening in a network always helps discover the kind of networking happening.
So a Leaf-Local L2 frame (a server sends to another server connected to the same leaf) would be switched locally with the local bridge domain/mac table. A Leaf-Local L3 packet would be routed by the local VRF. A Leaf-Spine-Leaf Packet would be mapped to the relevant far-end leaf tunnel endpoint and a tunnel endpoint IP would be pushed on it; it would then be tunneled/IP routed across the spine to the destination leaf; the destination leaf would then look at the socket-style mapping table of the destination endpoint; it would then pass onto the final destination endpoint.
While the Leaf-Local communications can be handled within the network device by tables, mappings and local lookups, it is obvious that when crossing the spines and reaching for a far end leaf there is a need for a control plane to communicate the far end addresses and mappings. A Protocol to exchange the distant leafs addresses and mappings which establishes the control plane for traffic to be switched and routed between leafs across the spines. There is a spine in the middle and the leafs are not directly connected. A Control Plane to distribute addresses and mappings.
There is a choice here.
For this Leaf-Spine-Leaf addresses exchange & inner/outer mappings population we could use a distributed, nuke-tolerant, internet style packet layer protocol OR instead use an SDN style central controller to do the thinking and push/program the network devices with all the addresses and mappings. The devices need to be populated with far end addresses and mappings and both will achieve this goal.
Our topic is an Automated Multi-Tenant Data Center Network and the automation part of the name is supported by the SDN style.
Why ?
The reason is that any distributed, nuke-tolerant, internet style protocol inherently requires independent configurations on all networking nodes which then enable the devices to start communicating. While an SDN controller is a single configurations point which pushes the configs onto the devices. This means that from an automation standpoint you will be either automating the configurations of hundreds of devices or an SDN controller. Configuring all devices in a large data center fabric independently is difficult to automate while managing the automation of an SDN controller or even levels of SDN controllers is easier.
This Data Center will need to speak to the outside world too.
This means that there will be a border functionality which will provide L3 and L2 reachability to the outside world. i.e. Ethernet L2 connectivity; VLANS or bridge domains, extended from a server in a leaf to a border node leaf and onwards to an outside world L2 construct, say an MPLS L2VPN.
Similarly an L3 VRF extension where a set of routes of an endpoint/server/tenant are stretched onto a border node leaf’s VRF via say MP-BGP style RD/RT mechanism where they are further extended onto an outside world MPLS L3VPN via a PE-CE routing protocol. (Tenant-Routes|VRF|MP-BGP|VRF| VRF-PE <> CE|Outside World).
Our topic also contains the word Multi-Tenant which means that in the case of L3 Multi-tenancy a legacy MPLS L3VPN style VRF/MP-BGP mechanism will be needed per tenant per VRF.
A similar mechanism is required for connecting two or multiple such large data centers between themselves. So “Endpoint<>Leaf<>Spine<>Border-Leaf<>|Infra-Link|<>Border-Leaf<>Spine<>Leaf<>Endpoint” communications are then possible. For L3 VRFs/MP-BGP can provide separation and ensure multi-tenancy for this Inter-DC comms.
When required some border leafs will obviously connect to routers which speak eBGP to the outside world. Other Autonomous Systems over Transit and Peering connections. These routers will have the global routing table and will gateways to the rest of the world. The PE-CE communications mentioned above for VRF stretching can be Static/OSPF/EIGRP/BGP.
To run this Automated Multi-Tenant Data Center lets not forget an overarching Orchestration software residing atop it providing a GUI mechanism into the wide array of options, tools, clogs and combinations to enable tenants Intra-DC, Inter-DC and outside-world communications.