Archive

Network Design

This post is not in English. 🙂

L2, ETH, M ADD, SRC/DST, M TABLE, ARP TABLE, GARP, VLAN, BCAST DOM, MTU, PMTUD. AppPMTUD, JUM FRAMES, 1500, 9000.

STP, ROOT Port, BPDUs, block, listen, learn, forward, disable, RSTP, MSTP, TCN BPDUs,

Swi, Hub, Rtr, Gway,

IP, SUBNET, ETH TY, SRC ADD/DST ADD, TTL, IP Add(Net:Host), CIDR

MPLS, LDP, LSP, LBL, EXP, PHP, FRR

TE, RSVP, LBL ST, LSP. Backup LSP via MPLS TE.

L3VPN, ROUTE, PE-CE , LBL Stack, VPNV4, MPBGP, AFI/SAFI L2VPN, RD,RT, VRF

VLL,L2VPN, VPLS, ELAN, TunnMTU,

OSPF, NBMA, MCAST, AREA, DR, BDR, Areas, LSDB, Rte Sum/Rte Fil b/w areas, Stub (No AS Ext LSA) , TotStub(No Sum, No Ext) , NSSA (Type 7 Ext Transit) , LSA, NET LSA, RTR LSA, EXT LSA, ASBR, ABR, SUMM LSA, ASBR LSA, ASBR SUMM LSA, SPT, HELLO, DEAD TIME, RTR ID,

BGP, TCP 179, TELN 179, PING, FW 179, NHRI, ROUTE, COMMUNITY, B PATH, LNG PREFIX M, WEIGHT, LPEF, ASPATH, IGP/EGP/LOCAL, PREPEND, BlaHol COMM, NHOP, PIC, PREFIX FIL, AS PATH FILT, COMMUN RPL LPREF Actions,

RR, CLUSTER, IBGP, B PATH, vRR

RIB,FIB, AD DIST,

ACL, DENY, ALLOW,

ASIC, CEF, RP, MEM, LC, CBAR FABRIC, LINE RATE, CUT THROUGH.

DC, CLOS, LEAF, SPINE, IGP, IP, LOOPBACK, BGP, ROUTE, VXLAN, MACinUDP, VRF,
MPBGP, EVPN, MAC-ROUTE, MAC VRF, EVI, ESI, SDN, BGP LS, PCEP, BDR LEAF, L3OUT, VRF, MPBGP, PE-CE, DC L2OUT, BD, VLAN, MTenant, BRI DOM, END POINT, 1-MAC/+IPSubnets.

TMETRY, PULL, GRPC L4, SNMP PUSH.

ANSIBLE, NETCONF/YANG, APIs, YAML, WSPACE, HUMreadable, XML/HTML, JSON, JINJA2, PYTHON.

TCP, SYN, ACK, SYN ACK, ECN echo bit, FIN, RST, SL WDOW, PORT, SEQ, P BIT/ImmSend, MSS, RELIA, Ordered.

UDP, CLESS, SPORT, DPORT, Length, checksum. DNS, DHCP, SNMP, Jitter, latency VoIP, unordered, mcast, bcast.

DNS LB, GSS, GLB, ANYCAST GWAY.

DHCP DORA, IP, BCast, Unicast.

DCI, L2/L3, EoMPLSoGREoIP, L2VPN, VPLS, EVPN, MP-BGP VRF.

SAN, SYN/ASYN, latency, DWDM/CWDM, iSCSI, FCoIP.

Workload Mobility, MobIP, LISP, ProxyIP.

QoS, DiffServ, MPLS EXP, E-LSP, L-LSP, per Class DiffServ Aware MPLS-TE vi RSVP Sig.

Linux, Expect, BASH, Python, AWK, SED, GREP, CRON, VIM, NANO.

Scri, D-Types:No, Stri,List (mutable),Tuple (immutable), Dict (key,value), Variables, Arrays(C), List-Stack (LIFO), List-Queue (FIFO).

Cond Prg, If, Ifelse, ElseIf, NestIf, While (Condition True) , For (Iterations known), Break, Continue, For Else, built-in functions, User-def-functions. Library, Framework, local vari, global var.

 

Layer 1, Layer 2, Layer 3 and Layer 4. Physical, Link Layer/MAC Layer, Network Layer, Transport Layer.

Physical is Physics which is improving allowing more bandwidth limits.

Layer 2 sitting atop physical involves Links/Mediums access control mechanisms between devices. It provides bits data transfer over the physical connectivity and involves payloads/addresses.

Layer 3 involves connecting multiple networks and forming an internetwork which further provides network level end-to-end connectivity.

Layer 4 involves end to end host level connectivity.

Within Layer 2 we have software enabled Virtual LANs, we have loop avoidance via Spanning Tree, we have Link aggregation via LACP etc.

Within Layer 3 we have IGP,EGP,VPN,SP,DC,WAN,TE, QoS and what not.

Within Layer 4 we have TCP,UDP; Connection-oriented/Connection-less; Flow-control, windowing; Reliability, acknowledgements, sequencing; Error control, checksum; Port numbers, etc.

Layer 2 and Layer 4 are relatively ‘localized’. Layer 2 due to its physical/link level vicinity and layer 4 due to its in-host & between-host proximity. While there is science in these layers it is somewhat local.

Layer 3 involves much geography. It is the domain which deals with providing end-to-end connectivity spanning much space and area. With this comes much management. It entails reachability across multiplexed systems via addressing, reliability via multi-pathing, reachability status communication, path preferencing in multipath options, path avoidance, virtual privacy and isolation across multiplexed systems, time management for timely fault tolerance and fault bypass. geolocation based path selection and load balancing. etc. etc.

Hence the birth of large-scale Internetworking Protocols.

Protocols which are engineered to have some have mechanisms built-in & agreed upon while some options require configurations.

Autonomous Networks which have all the mechanisms built-in and require no configurations are not present at the moment, except perhaps somewhere inside Google et al.

For now we have to sift and select between options and configurations for making data flow.

An automated multi-tenant data center network is an increasingly desired end goal for large and small organizations including providers. Servers that house the CPU, RAM and Hard Disk resources are serving traffic for applications they host. These servers need connectivity among themselves within the data center and also towards the outside world.

At first an organized set of CPU/RAM/HD Servers are connected to a network device. This happens in a Data Center rack and the network device is a ToR, a Top of Rack switch.  Another similar set of servers is connected to another Top of Rack network device. Multiple such sets of servers/network device pods are then linked together. The incumbent way to do this would be to make a leaf-spine Clos fabric. The layer of network devices connecting the servers are the leaf layer and the layer of network devices that is connecting these leaf nodes is the spine layer.

Hardware is thus laid out in a 2-stage or 3-stage Clos fabric and then we need to lay out a logical control plane to pass traffic. Applications on the Server CPU/RAM/HD will talk to each other within the DC which is east-west traffic or to the outside world which can be called north-south traffic.

Depending on the type of application east west traffic could be higher but north south traffic is always present.

Moving bits from a server to any other location is the networks job. These bits could be a compute hosting virtual machine’s bits or a ‘Serverless’ cloud application’s bits but they  go somewhere and are moving. They are moved by the network layer regardless of what resides on the servers.

How many layers of protocols and software are required to provide for an automated multi-tenant data center network which can connect servers, host applications and provide east-west/north-south connectivity ?

In the Networking Components blog post some basic networking components were listed out in a different construct: Network Device, Protocols, Protocol Messages, Addresses, Lookup tasks, Identity Tags, Filters & Actions, Network Over Network ( Overlay) Appended Information, Network + Network , Network Inside Network Device, Control and Data Plane.

In the Event-Driven Network Automation blog automation details were described.

The below will make some use of the networking components and event-driven network automation blog posts.

At first you need Addresses appended onto payload bits to ascertain endpoints and exchange traffic. How many layers of addresses will be required to connect the servers to each other over a fabric? In a full mesh structure the networking layer is small/direct and less addresses are required. In a Clos Leaf-Spine-Leaf fabric there needs to be multiple layers of addresses required.

A packet/frame structured bits data structure is switched across multiple nodes. In terms of Addresses Ethernet MACs are used for Layer 2 connectivity between servers NICs and ToR ports. The server could also have an IP Address of its own and be performing Layer 3 communications.

One server connected with one leaf could send an IP packet to another server connected with another leaf (Server<>Leaf<>Spine<>Leaf<>Server). As parts of the Control Plane of laying out the fabric the leaf and spine network devices will have IP addresses of their own which will speak to each other and send Control Plane Protocol Messages. What this infers is that there will be present 2 layers of IP communications. One between the network nodes themselves and one between the servers. This infers the requirement to have an IP address pushed on to another IP address in a tunnel type structure where from one network device to another (e.g. leaf to leaf via spine) the packet is routed based on Outer IP Addresses and the inner address is used by the server. Therefore some packets will require an addressing structure such as IP|Eth|IP|Eth. The IP Tunnel will span from a Leaf to another Leaf via the Spine, therefore the tunnel endpoints are at the Leaf switches. (Server-IP<encapsulation>Leaf-IP<>via Spine <>Leaf-IP<decapsulation>Server-IP)

We have multiple combinations of communications to deal with in multiple layers of the networking stack.  Leaf-Local L2, Leaf-Local L3, Leaf-Spine, Leaf-Spine-Leaf L2, Leaf-Spine-Leaf L3.  All this calls for multiple domains. A ‘Local’ Link Layer Domain, A Local Network Layer Domain, A Distant Network Layer Domain, A relatively distant Link Layer Domain. A link layer domain could be an L2 VLAN/broadcast domain or a bridge domain and a network layer domain could be a local VRF or a wider-spanning IP-in-IP domain level routing instance.

… A routed layer has IP addresses at two endpoints and an Ethernet link has MAC addresses at two end points. A virtual machine of a tenant in a server can have both an IP address and a MAC address. There could also be a single virtual machine having multiple subnets IPs behind the same MAC address ethernet link. This virtual machine is an endpoint and is this is what the network layer needs to provide connectivity to. Therefore we could say that an endpoint requires at least 2 tables at the network device it is connecting to. An IP Routing table and a MAC table. An ARP table is also required for Inter-Layer discovery. There is also the Leaf-Spine-Leaf IP-in-IP tunnel we spoke about which adds another layer of overlay Routing Table. In addition an outer IP to inner IP socket-style mapping function will be required which is another table (L4 Socket of Outer IP to Inner IP).

Discovering the places of destination-address lookup-actions happening in a network always helps discover the kind of networking happening.

So a Leaf-Local L2 frame (a server sends to another server connected to the same leaf) would be switched locally with the local bridge domain/mac table. A Leaf-Local L3 packet would be routed by the local VRF. A Leaf-Spine-Leaf Packet would be mapped to the relevant far-end leaf tunnel endpoint and a tunnel endpoint IP would be pushed on it; it would then be tunneled/IP routed across the spine to the destination leaf; the destination leaf would then look at the socket-style mapping table of the destination endpoint; it would then pass onto the final destination endpoint.

While the Leaf-Local communications can be handled within the network device by tables, mappings and local lookups, it is obvious that when crossing the spines and reaching for a far end leaf there is a need for a control plane to communicate the far end addresses and mappings. A Protocol to exchange the distant leafs addresses and mappings which establishes the control plane for traffic to be switched and routed between leafs across the spines. There is a spine in the middle and the leafs are not directly connected. A Control Plane to distribute addresses and mappings.

There is a choice here.

For this Leaf-Spine-Leaf addresses exchange & inner/outer mappings population we could use a distributed, nuke-tolerant, internet style packet layer protocol OR instead use an SDN style central controller to do the thinking and push/program the network devices with all the addresses and mappings. The devices need to be populated with far end addresses and mappings and both will achieve this goal.

Our topic is an Automated Multi-Tenant Data Center Network and the automation part of the name is supported by the SDN style.

Why ?

The reason is that any distributed, nuke-tolerant, internet style protocol inherently requires independent configurations on all networking nodes which then enable the devices to start communicating. While an SDN controller is a single configurations point which pushes the configs onto the devices. This means that from an automation standpoint you will be either automating the configurations of hundreds of devices or an SDN controller. Configuring all devices in a large data center fabric independently is difficult to automate while managing the automation of an SDN controller or even levels of SDN controllers is easier.

This Data Center will need to speak to the outside world too.

This means that there will be a border functionality which will provide L3 and L2 reachability to the outside world. i.e. Ethernet L2 connectivity; VLANS or bridge domains, extended from a server in a leaf to a border node leaf and onwards to an outside world L2 construct, say an MPLS L2VPN.

Similarly an L3 VRF extension  where a set of routes of an endpoint/server/tenant are stretched onto a border node leaf’s VRF via say MP-BGP style RD/RT mechanism where they are further extended onto an outside world MPLS L3VPN via a PE-CE routing protocol. (Tenant-Routes|VRF|MP-BGP|VRF| VRF-PE <> CE|Outside World).

Our topic also contains the word Multi-Tenant which means that in the case of L3 Multi-tenancy a legacy MPLS L3VPN style VRF/MP-BGP mechanism will be needed per tenant per VRF.

A similar mechanism is required for connecting two or multiple such large data centers between themselves. So “Endpoint<>Leaf<>Spine<>Border-Leaf<>|Infra-Link|<>Border-Leaf<>Spine<>Leaf<>Endpoint” communications are then possible. For L3 VRFs/MP-BGP can provide separation and ensure multi-tenancy for this Inter-DC comms.

When required some border leafs will obviously connect to routers which speak eBGP to the outside world. Other Autonomous Systems over Transit and Peering connections. These routers will have the global routing table and will gateways to the rest of the world. The PE-CE communications mentioned above for VRF stretching can be Static/OSPF/EIGRP/BGP.

To run this Automated Multi-Tenant Data Center lets not forget an overarching Orchestration software residing atop it providing a GUI mechanism into the wide array of options, tools, clogs and combinations to enable tenants Intra-DC, Inter-DC and outside-world communications.

Enabling application endpoints to communicate via a network requires a whole bunch of protocols in the networking layer. Different protocols providing different functionality each providing a brick making a wall which is achieving the end goal of endpoints communication.

Physically after transceivers have delivered ordered bits in a memory location in a network device they are digested. It could be any of a number of control plane or data plane datagrams that the network device needs to digest.

It could be a Layer 2 MAC / Ethernet layer frame aimed at information transfer within the local area. It could be an ARP control plane frame. It could be IP address reachability information like an OSPF or IS-IS control plane packet. It could be a TCP handshake packet or a TCP Payload packet. It could be a UDP packet. It could be a BGP Update message providing next layer (IP) reachability information.  It could an MPLS labeled packet being switched across through an IP core network.

It depends.

Regarding Reachability Wikipedia states:

” In graph theory, reachability refers to the ability to get from one vertex to another within a graph. A vertex s can reach a vertex t t (and t is reachable from s) if there exists a sequence of adjacent vertices (i.e. a path) which starts with s and ends with  t.

In an undirected graph, reachability between all pairs of vertices can be determined by identifying the connected components of the graph. Any pair of vertices in such a graph can reach each other if and only if they belong to the same connected component. The connected components of an undirected graph can be identified in linear time. The remainder of this article focuses on the more difficult problem of determining pairwise reachability in a directed graph. ”

It’s interesting that mathematically a network is a Graph and a networking device is a Vertex but we’re blogging on networks and not on math.

BGP

BGP neighbors are manually configured to utilize a TCP connection at port 179 to exchange IP address routing information. This is the most common use on the wider Internet where transit providers use BGP to exchange IP routes of connected networks. A large service provider which sells internet transit uses BGP to peer with similar other service provider networks and with server hosting providers.  BGP can also be leveraged to advertise information other than IP e.g. MAC routes in EVPN.

Practically speaking any two routers with an established BGP connection send update messages to add and withdraw IP Prefixes (routes) and the routes attributes (AS Path, Community etc.).

BGP has a full finite state machine diagram where a session transitions from Idle state to Established state. Initially Idle it transitions to Connect, OpenSent when Open message is sent, Active state, OpenConfirm where both sides have sent Open message and then Establised where a final acceptance Notification message is sent and thereafter keepalive messages are exchanged. In the OpenConfirm state the two BGP ends have both sent Open messages to each other and are checking the information to see if a BGP session with this peer should be established. The primary information in the Open message include the BGP version number, the AS number, the hold timer, the bgp router id and the optional parameters.  The optional parameters contain TLVs which negotiate attributes such as MP-BGP extension to be used between the peers.

Once established Update message is sent with the routing information and route attributes. Every Update message causes the BGP route table to update and route table version number to increment. An update message contains unfeasible routes, path attributes and NLRI which are IP routes. Path attributes such as AS_Path, LocalPref and MED are present in the Update message.

iBGP as opposed to eBGP is used to communicate routes with an Autonomous System. The AS_Path is treated different in the case of iBGP where a router only adds its own AS number in the path if its speaking to an eBGP peer and does not add its own AS number if its speaking to an iBGP peer. Otherwise if the BGP process sees its own AS it would drop the route assuming a loop. Either a full mesh is required for iBGP so that every router knows every destination of a Route Reflectors could be used to peer with iBGP speaker and reflect routes. Routes received from a client in an RR setup are reflected to other clients and non client neighbors.

One of the mechanisms in BGP is the best path selection methodology. If an IP prefix is reachable from multiple paths BGP has a list of if else steps through which it transitions to select one best path and advertise that.

The best path selection criteria are given below.
1) Weight (Cisco locally assigned – higher weight preferred)
2) Local Preference – Prefer path with higher local pref
3) Network or Aggregate (Cisco local route vs aggregate route)
4) Shortest AS_PATH  (Prefer path with shorter as path)
5) Lowest origin type IGP < EBGP
6) Lowest multi-exit discriminator
7) eBGP over iBGP
8) Lowest IGP metric
9) …

Another aspect of BGP is the route filtering and route manipulations via Community attributes. Where a community attribute is sent in a numbered format e.g. 6939:400 to trigger an impact on the far end neighbor path selection. For example if one neighbor send 6939:400 community to another neighbor the receiving side will set Local Pref of the route to 400 based on a previously agreed upon understanding. This is achieved by if-then-else route policies are the receiver end.  Commonly used communities include Local_Pref setting communities and blackhole communities.

Another aspect of BGP is Multihoming and traffic load balancing. If one autonomous system is multihomed to another autonomous system it will use LPref, Communities and AS Path prepending to influence traffic.

BGP has also been used as an IGP alternative is Massive Scale Data Center deployments using Clos fabrics.

BGP is flexible, scalable, stable and reliable but it is slow in convergence, has limitation is terms of load balancing and requires large CPU/TCAM in case of large routing table sizes.

 

References

https://learningnetwork.cisco.com/blogs/vip-perspectives/2017/12/14/demystifying-bgp-session-establishment

https://www.inetdaemon.com/tutorials/internet/ip/routing/bgp/operation/messages/update/

https://clnv.s3.amazonaws.com/2018/usa/pdf/BRKRST-3320.pdf

https://blog.ipspace.net/2017/11/bgp-as-better-igp-when-and-where.html

http://huzeifabhai.blogspot.com/2011/08/eigrp-ospf-bgp-strengths-weakness.html

 

 

 

 

The difference between the shortest path in a network and the path that traffic between two points actually takes is defined as network stretch. It can also refer to the difference between the shortest physical path and the shortest logical path a packet being forwarded must travel. There could be a difference between the shortest physical path and logical path if the link cost between a set of hops is higher resulting in logical path being different.

Network stretch can be calculated based on comparing hop counts through a network, the metric along two paths and/or the delay along two paths among other things.

Stretch is not always bad and increasing stretch via Policy Based Routing or Traffic Engineering to push traffic off the shortest physical path onto a desired logical path is a desired outcome.  In this case the post TE network stretch, the difference between the physical path and the logical path, is desirable, required and is a policy decision.

Defining and calculating network stretch can aid in finding the complexity of a network.

References: Computer Networking Problems and Solutions (2017)

 

 

To gain an understanding of components that make up networks we’ll start by stating that a network is a combination of tools working together to provide connectivity to endpoints.

Let’s list the tools.

Network Device (Switch and Router and others) – This is a device which terminates multiple cables into itself with the other end of the cable being other devices. The network device interconnects multiple endpoints via its ports on which cables terminate.

Protocols – These are tools which provide for a coordination mechanism. This coordination mechanism is an exchange of information which makes possible the exchange of traffic.

Protocol Messages – These are messages exchanged between Protocols while they coordinate the laying of the network foundations for exchange of traffic.

Addresses – These come in many flavours and are intended to identify the source and destination of a data payload which is traversing the network.  They can be layered/structured for aggregation and division pools.

Lookup – This is done on the various addresses to find the next hop. Lookup is done to find the next point to which to send the data payload to so that it reaches its ultimate destination after traversing the network.

Appended Information – This is a general term which encompasses information traversing the network which is other than payload and addresses. These are information and tools which are put into packets for protocol operations. This is information inside headers other than the addresses.

Identity Tags – This is a specific class of Appended Information which provides for identity functionality during a lookup and for identification and separation of protocol functions.

Filters & Actions – These are deployed on the network devices to provide intelligent selection and resulting actions over the traversing data payload. They utilize the addresses and appended information inside the data payloads and also the headers.

Network Over Network – This is a general term for a network on top of a network for provision of separate connectivity. A combination another layer of protocols and addresses result in a network over a network.

Network + Network – This is a term identifying the interconnection of 2 or more separate networks resulting in a larger network. Also called internetwork it signifies one domain interconnected to another domain.

Control and Data Plane – Control Plane is the network protocols laying the network foundations and data plane is the traffic traversing the network. Control Plane enables Data Plane.

Network Inside Network Device – This is a term signifying the division of a network device to facilitate a software separation in networks. It creates separate networks inside a network device via operating system software constructs.

We can put brands on these:

OSPF/ISIS/BGP are Protocols to lay the Control Plane for IP addresses

LDP is the Protocol to lay the control plane for MPLS addresses (labels)

MAC Address / IP Address / MPLS Labels are addresses and Lookups are done on them during Data Plane operation

MPLS L2 VPN / MPLS L3 VPN are a Network Over Network function based on labels.

MP-BGP is a protocol to lay Control Plane for Network over Network (MPLS L2VPN & EVPN)

AS to AS BGP connectivity is a Network + Network function

Route Maps / Prefix Lists / AS Path Lists are part of Filters and Actions

OSPF Areas and ISIS Levels are a Network domain + Network domain layering type function

QoS Diffserv and CoS are appended information for actions and functionalities

EVPN, OTV & VXLAN are Network over a Network options. These provide a network over a network Control Plane and network over a network Data Plane.

VXLAN VNID / VLAN TAG / Route Target / Route Distinguishers / BGP Communities are Identity tags for protocol operations where they aid the control plane or data plane.

VDC / VRF / EVPN EVI are Network inside Network Device features primarily being operating system software constructs.

This is a rough approach with much simplification but is intended to view the various network components as tools providing functionality working in unison for connectivity provision. This view aids looking at the components from a Design perspective.

Whether it is a Service Provider, Enterprise or Data Center / Cloud IaaS network the components interact and provide functionality.

I passed the Brocade Certified Network Designer certification yesterday. It was a good learning experience preparing for the exam and learning about Network Design. My study path for successfully achieving the certification included (among other things) reading the CCDA 4th Edition Book. (Hard copy straight from CiscoPress) and studying the BCND in a Nutshell Guide (Free – Online).

Rubbing shoulders with a few network architects at their blogs, podcasts & webinars also helps. If you’re interested you can look at @packetpushers, @etherealmind & @ioshints.

bro_edu3_cert_net_des_rgb