Infrastructure as Code has two main sections to it. The first is running the code itself and executing the change onto the cloud platform. The second is maintaning the version control of the code. i.e multiple changes by multiple people to the code.

For executing the change one could use Ansible or Terraform.

From a high level you simply run an Ansible playbook while having changed the variables files to make changes to the environemnt. One could do this without getting too much into the way Ansible is working. Inside Ansible there are roles and tasks which divide the execution of the playbooks in a structured format so that writing complex playbooks is easier.

The second part regarding version control of the code is required because there are multiple people in a team making multiple changes to the Infra as Code variables. So for example one person could be adding a firewall rule subnet to one firewall and another person could be adding a firewall rule to another firewall. So if you imagine that all the firewall rule subnets are actually present in one variables file for all the firewalls then you need version control to coordinate these two changes to the file. This version control is done by Git and Bitbucket mostly and these are the two famous tools to maintain software code versioning.

This is definitely similar to what any large software system build and maintenance would require where multiple software developers are writing code and changing code in all sorts of code files at the same time so you need a version control system to maintain consistency. These have push and pull mechanism where when you make a change locally and push it onto the main file and at the master file you pull the change. It also has peer review mechanisms where other team members can review your code differences before they allow your code to enter the main repository.

To conclude, imagine you have 30 Azure VNETs (Network VRFs) and 30 Azure Firewalls in your product deployment. As people ask you regularly to make firewall rule changes and add and delete subnets it requires either manually going to the Azure web portal and making changes their or you could use Infrastructure as Code and make the changes via Ansible playbooks and git/bitbucket variable files.

This post will cover multicloud networking integration between multiple public clouds and on prem network. Imagine four clouds three being AWS, Azure and GCP and the fourth being the on prem private cloud which is basically a Data Center network.

All these four clouds will be glued together somehow and that glueing will be the multicloud scenario. The basic requirements would be to have switching, routing, firewalling and load balancing equipment present within the glueing network between the four clouds.

Switching would be present to trunk layer 2 between IP endpoints. Routing and routing protocols like BGP would be there to exchange the IP endpoints reachability information to populate routing tables and get the Nexthops.

IP planning would be involved in the sense that the On Prem and the three public clouds dont have duplicate conflicting IP address spaces and there aren’t two endpoints in the network which are generating packets with the same source IP address.

In essence if there a single routing table present in your environement which has routes for all three Public cloud endpoint subnets and also the routes for the on prem DC network then you have multicloud established.

Wherever this routing table exists from that location there will be Layer 2 swithcing links and trunks into the three clouds and On-Prem until the trunks reach the other routing tables within the clouds, be it Azure VNET routing table, AWS/GCP VPC routing tables or On-Prem DC Routing Tables.

This multi-cloud environment is somewhat similar to large Service Provider public internet networks we are all familiar with where each large SP can be considered a cloud in itself with routes being exchange with the other large SP i.e. similar to cloud routes over BGP.

The SP environment are mostly used for traffic passing through whereas in the multi-cloud enterprise environemnt there are Data Sources and Data Sinks in either the On-Prem or in the Public Clouds. There is also the difference that the glueing network in the middle will have firewalling too.

Lets say there is a new connection required to a VPC subnet in a AWS region. Firstly the layer 2 would be provisioned over the AWS Direct Connect either directly with AWS or with partners like Megaport. For the majority of the cases the on-prem device which connects to the direct connect service will be provisioned with a new VLAN.

Once this is done this layer 2 will be trunked to the on prem device where IP endpoint is provisioned and the routing table exists. This could be a firewall or a router. This is where the packets will decide on the next hops.

On-Prem firewall filtering is in the path where the different DMZ regions, different IP Subnets and L4 Ports are allowed or disallowed to communicate with each other. If the On-Prem device with the routing table containing the multi cloud routes is a firewall things are simpler in the sense that the firewall filters are present on the same device and the different clouds are treated as different DMZ zones.

This multicloud networking scenario is a routing environment which has multiple routing domains as spokes linked via a hub site. This hub site is the on-prem glueing routing table. There would be the addition of firewalling capability within this environment so as to be able to govern and allow/disallow traffic between these environments. Another addition could be a load balancer within the glueing on-prem environment.

This load balancer would spray traffic onto either on-prem DC subnets IP endpoint servers or onto the public cloud subnets housing cloud servers. This would mean that there will be public facing IPs which receive the traffic which is natted onto Private IPs and then it is loadbalanced onto the multiple server endpoints be it in Public clouds or in On-Prem DC.

So the load balancer would have the load balanced front end IP to Server IP bindings going towards either a public cloud endpoint or an on-prem endpoint. This would mean that the load balancer connects to the glueing routing table entity as well to send/receive traffic to server IPs.

This mix of route, switch, firewall, load balancer is an example of a typical multicloud network connecting multiple public clouds.

Habib

As fresh Pakistani engineers start leaving their country on Washington Accord visas one wonders whether back home Digital Policies are being framed which could be sealing their jobless fates.

Let’s check the numbers. If half of Pakistanis generate only 5 MB of Data in one day on government run Digital Pakistan then it would amount to 500 million MB in a Day. This is half petabytes per day. This will only keep growing. All this data, it’s processing and it’s related networking will possibly be run on equipment which will only add to the import bill if Pakistan doesn’t manufacture it’s own servers. It would also traverse imported networking routers and switches which would add to the import bills if Pakistan doesn’t manufacture it’s own network equipment. All of these would also be put in Data centers which could be using Racks and Cabling possibly all imported.

How many jobs will imported servers, imported switches, imported routers, imported racks, imported DC HVAC and imported Data Center cabling produce ? And what will be the import bill of these Digital Pakistan backend items ?

Another aspect of these imported items is their lack of Cybersecurity from a National Security perspective. If it’s imported and all plug and configure only with unknown hardware and unknown software it will be considered a black box and totally insecure in terms of Cybersecurity.

A further aspect of these imported items is that each item comes with support contracts in case they fail and have a problem. These are very expensive support agreements with their manufacturers and will add to running cost and yearly import bills.

Now consider that a while back the aeronautical complex in Risalpur launched its own tablet, the PAC PAD Takhti 7. https://en.m.wikipedia.org/wiki/PAC-PAD_Takhti_7. How did that happen and why can’t we make our own Digital Pakistan equipment. How is it possible that Pakistan can make parts of JF-17 thunder and indigenously manufacture multiple types of missiles and also make a nuclear bomb but not make it’s own servers, routers, switches, DC HVAC and DC Cabling ?

Much of these IT equipments are now open sourced. Servers, Routers and switches under OCP and there is MIPSOpen and multiple open source Network Operating Systems. Positive results are really possible in case solid effort is made for local manufacturing.  At least Cybersecurity mandates that the Hardware assembly and Software assembly and their System Integration is carried out within Pakistan. This will create Jobs and reduce the import bills too.

Let’s hope for the best.

This post seeks to distinguish between the multiple aspects and phases of networking projects. Network Architecture and Network Design are the phases of a networking project carried out first. Then comes the Project Implementation phase along with configurations by Network Engineers.

Some experts have included an Analysis phase as part of, or before, the Network Architecture phase. The concepts being that first an analysis needs to be done on the flows expected from the new network.

Before Network Architecture the Analysis phase consists of gathering the User Requirements, Application Requirements, Application Types, Performance Requirements, Bandwidth Requirements, Delay Requirements etc. After gathering these requirements a Customer Requirements Document (CRD) can be made consisting of all the expectations and requirements from the network. This document will assist with project management throughout the network life cycle and for sufficiently large projects its a good exercise.

Once the requirements are gathered a Flow Analysis can be done to identify the flows required from the network. Data Source and Data Sinks, Critical Flows and per Application flows etc. are analyzed as part of Flow Analysis exercise.

Once the requirements are known and flows are known this can lead to decisions regarding the Network Architecture. The Network Architecture term is generally used with the Network Design term as one but according to one definition it is distinguished from Network Design such that the Architecture consists of the technological architecture while the design consists of specific networking devices selected and vendors selected for the architecture to be implemented on ground. This means, for example, that the Network Architecture will deal with whether to use OSPF or ISIS and how to use them and the Network Design will cover which specific vendor router to use. They are closely linked.

Once the flows are known it can be discussed what the architecture can be. This will consist of primarily deciding the protocols, the addressing and the routing architecture which can be used to facilitate the required flows. Once it is decided which network technologies to use for the flows (such as OSPF, ISIS, MPLS, L2VPN, L3VPN, IPSec, BGP, Public Internet, VXLAN, EVPN, Ethernet etc) a diagram can be made of the architecture. Multiple iterations and permutation of the various architectures will come forward from the discussions over what the architecture could be to facilitate all the flows and provide a resilient network. For each of the protocols listed above, and any other to be used, the clogs available in each can be discussed in detail. It can be discussed and decided regarding how the combinations of multiple protocols will be used to meet all the flows and meet the requirements from the network. If there are cloud connectivity requirements it will be discussed how (which protocol) and where to connect to the cloud. Once an architecture is decided and protocols are selected and the tools within the protocols which are to be used are listed then they can be summed up in a document and in diagrams.

After this phase comes the Design decisions phase. This is close to the architecture phase but this is where the vendor of that OSPF router is selected. This is where the specific router is selected from the multiple router offerings available from the selected vendor. Device vendor selection and specific device selection is a task of its own and is a separate effort in networking projects.

Also as part of the Design it will also be decided which Service Provider to use for Internet and WAN links. It will be decided which service offering will be used from the SP Vendor. If the application and system contain Public Cloud use (including Hybrid On-Prem) than it will be decided which specific connectivity mechanism and location the cloud will connect to. Will it be IPSec over Internet or over Direct Connect and where and how. Will it be the biggest MPLS VPN provider on the market or the smaller one. Will it be the biggest BGP Internet Transit provider or the smaller one.

Once the requirements are known; Once the flows are knows ; Once protocols and architecture is known ; Once the device vendors and device type and SP offerings are known and once all of these are selected than comes the implementation phase.

Engineering is a broad term which can encompass all of the above and more but as things stand here we can say that a Network Engineer as part of the engineering phase will configure and deploy the devices, configure and deploy the WAN links, configure and deploy the Internet links, configure and deploy the cloud connectivity VPNs and configure and deploy the interconnections in the network. This network engineering implementation effort is after the Requirements/Flows/Arch/Design phase as its an effort on ground and on site to implement the network and make things run. Up until this phase all the previous phases were on paper and this one is on ground practical work.

The previous Requirements/Flow/Protocols Architecture/Design and even initial aspects of the engineering phase can be done in office in meeting rooms. Initial aspects of engineering phase consisting of configurations and parameters to be used can be also decided before going out in the field. Once on ground and on site implementation starts than this is an effort of its own and can be considered as Project Deployment and Project Implementation. It entails device delivery, WAN link delivery, device power on, WAN link testing, Internet Link testing, Cloud VPN delivery, configurations and testing etc. This is a phase of its own and is an effort which is more akin to technical project management as well as it is more of an on ground project coordination and project management effort too. This is because of its physical, geographical and on site implementation aspects.

Depending on the type of project the implementation phase can consist of outage windows and maintenance windows and a lot of coordination to implement the new devices and new links.

Hence we can say that a networking project consists of separate requirements gathering, flows analysis, architecture, design and implementation phases. This means that a networking project can be divided into smaller multiple projects each consisting of these above phases. Each phase also requires a skill of its own. For example the Requirements, Flow Analysis, Architecture and Design phases are generally handled by Network Architects, Solution Architects and Network Design Engineers. The configuration and deployments aspect is handled more by Network Engineers and the Project implementation and coordination efforts are handled by Project Managers.

Multiple and simultaneously such large scale projects having all these phases going on at various levels would be run under a Program given the size of the organization is sufficiently large and that there are multiple streams of such projects being carried out.

I hope you enjoyed the good read.

Happy networking.

Habib

In Networking, Cybersecurity should now be Distributed Cybersecurity. Distributed Cybersecurity is where each organization and each home is independently capable of monitoring its traffic and inspecting it on demand.

This paradigm is where the Wifi AP at home is capable of sending its traffic in-out Data to somewhere within home and where the AP’s and Routers in an office setup are capable of sending their in-out information to somewhere within the local office.

This requires Netflow capable Wifi Access Points and Netflow capable CE Edge routers. It also requires a setup of a Netflow receiving station where the information sent by Wifi Access Points and Cisco routers in Netflow language will be received and displayed.

In Segment Routing the concept of source routing is present:


In computer networking, source routing, also called path addressing, allows a sender of a packet to partially or completely specify the route the packet takes through the network. In contrast, in conventional routing, routers in the network determine the path incrementally based on the packet’s destination

Wikipedia

In prevalent IP networks per-hop lookup is performed based on the single primary destination address in the packet. Consider a situation where a stack of IP addresses is present per packet and needs to be processed by the intermediate routers. There would be a requirement from the hardware in the line cards. In this hypothetical situation how deep of a stack of addresses can be processed by the router chipsets and hardware ?

Similarly the SID Depth or the Maximum SID Depth is a parameter in segment routing enabled network devices. To route from a ingress to an egress the path selected by the source should be entirely capable of handling the number of SIDs ( MPLS Labels in SR MPLS) that are pushed onto the packet. Because the path selected by the source is in effect translated into a stack of labels (in SR MPLS) therefore the number of labels that the each device in the path can handle is an important design consideration.

Also, in Segment Routing MPLS the SID i.e. the labels are distributed via the IGP. So an end to end path label stack is supposed to be either in a single IGP area or if multiple routing areas or domains are required then some tricks are required to push and handle a label that is not distributed by the IGP. Lets see now: An external entity will need to program the ingress source node to push a stack of labels which includes a label not distributed by the IGP. This being the source there will be a resultant intermediate destination where at some point on some hop a label will be popped and the next label will be not have been learnt via the IGP.

In some way the burden of end-to-end connectivity over multiple hops is being shifted from the distributed IP routing control plane into a central label stack distribution authority.

I wonder if where we had IP Planning and IP configurations we will have label planning and label configurations.

Shifting a portion of the intelligence present on distributed nodes to a central authority.


Information is present in computing platforms in two forms.

– Bits that are stored
– Bits that are traveling and transitioning

Securing bits that are stored and bits that are traveling and transitioning is a task.

These two forms present their own challenges but the bits that are traveling and transitioning i.e. changing forms within the computing platforms have acquired special attention. This is due to the prevalent pervasive communications using information technology computing platforms within society and businesses. When bits transition and travel they are also stored and retrieved from storage so securing both is important.

The only mystery surrounding the field of security is the presence of the all so many interaction surfaces between hardware layers and software layers through which transitions and traveling of bits occurs. From seeing text on the screen with ones eyes to thinking and considering it to thereafter editing it via hands there exists industries working within the human body which occur without us contemplating over them. There are interaction surfaces with the body as well. With muscular, neural, skeletol, etc working together to name a few.

Within computing platforms as the bits transition back and forth within one component i.e. one isolated CPU, RAM, HardDisk, Operating System and Application Software they present their own security challenge. When instead of isolation the bits travel between 2 such computing systems they present a different set of challenges. When there exists industrial scale, constant, consistent, ongoing back and forth travel and transitioning within milliseconds over large geographies between hundreds and thousands of components of various types it presents a completely different set of challenges.

Interaction surfaces are where bits change hands between subsystems. For example bits changing hands between the operating system and an application running on it or bits changing hands between one PC and another PC over a network. Interaction surface is when one subsystems surface interacts with another subsystems surface within the larger system and bits run. As the field of information technology and computing has evolved and progressed the number and types of subsystems, their surfaces and their interactions has increased a lot. So much so that securing them has become complicated. Wholesome security is therefore achieved when every time bits change hands i.e. transition and travel the interaction is secure. It is secure in the form that the storage at each end of change of hands is secure and the medium of exchange is secure.

Now it is simple to state in general english that when one subsystem interacts with another subsystem and bits change hands the storage points at each end and the medium used for the interaction and travel should be secure. Given timescale and geographical scale when it comes to reality the shear number and types of subsystems, the number and types of storage locations and the number and types of exchange mediums is so large that encompassing all of them becomes difficult.

Another incision into the security domain is cut deep into the system when the human computer interaction surface appears at various locations and in various forms. This increases the complexity of the whole security domain. Bit to Human interaction surface also needs to be kept secure at each interaction, at each geographical location and every time.

Furthermore another aspect is when one secure system under the ownership of one entity interacts with another system owned by another entity. This is therefore a time when bits are changing hands amongst different owners of them. The time and location of such an interaction surface presented between two separate ownerships also increases complexity. As your bits are stored under the ownership of another entity and accessed and retrieved by other people a whole system of management is required for such inter-ownership bit storage and bit travel interaction surfaces.

I guess a chart showing the whole variety of interaction surfaces within computing would demystify security. The reason for this is that each entry in the chart i.e. each interaction surface would be simply mapped to the precaution and action required for securing it. Each type of interaction surface would require a security precaution and actionable item within the security framework.

Be it an interaction surface where bits are:
– stored in hardware
– being processed by one set of software
– within one computer
– on a server
– in an application
– traveling over a network
– interacting with humans
– being exchanged between different humans
– being exchanged between different entities

This is a copy of a previous Linkedin Post Dated June 7 2016 which was not present on this Blog.

https://www.linkedin.com/pulse/opnfv-brahmaputra-systems-integration-nfv-vnfs-lutfullah-kakakhel/

OPNFV Brahmaputra is a Lab ready release of OPNFV. One statement is that community driven Systems Integration really is a hard task to accomplish. This becomes especially true if the systems being integrated to form a larger system are actually multiple large open source projects themselves.

To start with OPNFV aims to integrate systems upon which VNFs can be run.

The caption above is heavy. On the one side there is the requirements generating standards bodies block of organizations which produce specifications and define how the system is to run. On the other side there are the code producing development projects which produce open source projects. OPNFV stands in the middle and intends to integrate these individual code projects according to the requirements laid out by the standard bodied and provide a system on top of which VNFs can be run and tested. The reason this task is being run under an umbrella membership based organization such as OPNFV is because it is a repetitive task which every organization will need to do over and over again as soon as new releases of codes are made available for the individual projects.

It might be difficult to picture this to start with but imagine you want to have a lab ready to run and test VNFs. What is the lab composed of? It will have Infrastructure on top of which VNFs can be run. What is this Infrastructure composed of? This Infrastructure will be composed of hardware and a virtualisation layer and hypervisors and networking projects such as OpenDaylight and Openstack and KVM and Ceph all running together to provide a block of Infrastructure virtual compute network and storage (An NFVI Point of Presence) on which VNFs can be run.

Every organization which wants to reach the level of testing VNFs will need such a lab. And then what happens when a new version of OpenDaylight is released or a new version of Openstack is released or KVM or Ceph? Everybody needs to update their labs. OPNFV is a Linux Foundation project which intends to be the focal point of these activities and perform them jointly instead of everybody doing them individually.

It also helps make the system work. A patch to OpenDaylight could work well within OpenDaylight but could break things at System layer when integrating with the rest of the components which make an NFV lab (to be used to runs VNFs). OPNFV aims to be the first systems layer at which point such patches can be spotlighted and returned to the project they came from informing them that at the system levels things get disjointed.

OPNFV according to its initial white paper aims to make this systems testing environment in line with the NFV Architecture References points of Vi-Ha, Vn-Nf, Nf-Vi, Vi-VNFM & Or-Vi.

After the above is clear the figure below can be understood to be a larger system composed of individual projects integrated together with the aim of running VNFs. In the figure below OpenDaylight is one piece (in network), KVM is another piece (in compute), Openvswitch is another piece and Openstack is also one piece. All these when put together provide the infrastructure to run VNFs. Also to be noted is that in the case of OPNFV there are community labs (Pharos Labs) which provide the hardware.

The presence of this combined effort also means that for Network Operators the differentiation in the market is in Service Orchestration. The Virtual Network Functions and the Network Services run on top of them.

 

References:

http://www.etsi.org/technologies-clusters/technologies/nfv

http://www.slideshare.net/CiscoDevNet/devnet-1162-opnfv-the-foundation-for-running-your-virtual-network-functions

https://www.opnfv.org/brahmaputra

http://www.slideshare.net/OPNFV/opnf-brahmaputra-an-early-look

https://www.opnfv.org/sites/opnfv/files/pages/files/opnfv_whitepaper_092914.pdf

https://www.youtube.com/watch?v=Dh55McgHGQ8

This is a copy of a previous Linkedin Post Dated June 7 2016 which was not present on this Blog.

https://www.linkedin.com/pulse/nfv-mwc-2016-syed-habib-lutfullah-kakakhel/

ETSI showcased a practical implementation of NFV at the Mobile World Congress 2016. They showed the whole NFV Architecture being implemented and run to provide a SIP voice call. An end to end communication service of a SIP call was made based on a vIMS platform. This vIMS is an NFV VNF orchestrated by a NFV Orchestrator run on top of Infrastructure controlled by an Openstack based VIM. Let’s see the components and how they made the NFV based SIP voice call.

There are two NFVI PoPs (Points of Presence) or two VIMs. One is Openstack controlled and the other is controlled by openvim (part of OpenMano package). Both are controlled by the Open Mano NFVO for resource orchestration. The Service Orchestration is performed by Ubuntu’s Juju.  The launchpad of Rift.io is used as triggering mechanism for resource orchestration and service orchestration. 6wind provides the PEs showcasing corporate VPN interconnectivity. Telefonica provides the traffic generator to test the bandwidth capacity of the PE links and Metaswitch provides the VNF vIMS Clearwater for being run atop the infrastructure.

The figure below shows details:

A multi-site corporation’s network is shown to be running connected via 3 PEs. One site which is connecting to PE 3 has the VNF deployed in VIM2 which is another Data Center. One NFVI PoP labelled VIM 1 is hosting the 6wind PEs while the second NFVI labelled VIM 2 is hosting the VNF. There is interDC communications going on between the two NFVI PoPs. The figure below shows the SIP voice calls communication logical path. The IMS protocols SIP signaling is implemented in VIM 2 in the Metaswitch Clearwater vIMS.

More details can be seen here.

ETSI’s new initiative is delivering an open source NFV Management and Orchestration software stack which is set take away attention from the MANO and turn it into a given piece of software. This puts more focus on the VNFs. The message could be that Service Orchestration using VNFs are therefore to be the focus of attention for Telco organizations.

References:

https://osm.etsi.org/

http://www.etsi.org/technologies-clusters/technologies/nfv

https://networkbuilders.intel.com/docs/E2E-Service-Instantiation-with-Open-Source-MANO.pdf

https://www.youtube.com/watch?v=JJlxwJStkTk

This is a copy of a previous Linkedin Post Dated June 7 2016 which was not present on this Blog.

https://www.linkedin.com/pulse/nfv-telco-vepc-solutions-syed-habib-lutfullah-kakakhel/

In telecom networks the option to place an LTE vEPC stands out as an exemplary demonstration of NFV’s application. The figure below gives the generic NFV Architecture. It be divided into 3 main sections:

  1. The Management and Orchestration
    1. Consists of NFV Orchestrator, VNF Manager and Virtualization Infrastructure Manager.
  2. The NFVI – NFV Infrastructure
    1. Consists of Hardware, Virtualization Layer and Compute, Storage, Network Virtualization Software
  3. The VNFs – i.e. the virtual network functions.

The type of function the VNF provides shows what this NFV network delivers. That is if the NFV network delivers as a Network Service an LTE Core end to end communication then there will be EPC functionality implemented and provided by the VNF part of the NFV network.

See the figure below from an ETSI Proof of Concept work.

It shows the vendor CYAN providing NFV Orchestrator (NFVO) and VNF Manager (VNFM). Redhat and Openstack provide the Virtualized Infrastructure Manager (VIM). The figure also shows the relevant Infrastructure hardware and hypervisor software solutions. Finally it shows the VNFs as being Connectem’s vEPC. If the VNF was implementing a different functionality say it was a vIMS then the rest of the components in the figure could be the same and the end to end Network Service being provided by the NFV network would be different. Therefore the function implemented inside the VNF defines what service the NFV network provides. Therefore the work done by the VNF decides whether your NFV network is Telco or Enterprise; LTE or WiMAX; LTE or 3G etc.

For a list of possible Telco VNF’s see the figure below.

Every chuck of functional blocks could be implemented together as a VNF. So what Connectem is doing is implementing the LTE MME, SGW, PGW, HSS, PCRF functionality and packaging it as a VNF which can be run atop virtualized infrastructure. In the ETSI Proof of Concept, Connectems solution therefore does this according the NFV specifications so that the VNF can be managed by a VNFM and its infrastructure is composed so as it can be managed by a VIM and all this can be controlled and coordinated by an NFVO i.e. NFV Orchestrator. Therefore you get an LTE EPC functionality inside the virtualized NFV environment.

The ETSI definition of a VNF is that a VNF is “a Network Function capable of running on NFV Infrastructure (NFVI) and being orchestrated by a NFV Orchestrator (NFVO) and VNF Manager”.

Coming back to our vEPC example the VNF has components. These VNF Components (VNFCs) can logically be pictured as below:

ETSI mandates that a vendor can choose to implement components as they wish inside the VNF environment as long as they speak to the other NFV architecture components as per their defined VNF interfaces. This means that the different components can utilize efficient compute storage and networking procedures instead of the standards body defined communication methodology. An example is that inside the vEPC software the MME will communicate with the SGW but will utilize efficient computational methodology instead of the 3GPP defined interfaces. If for some reason (say a lab environment) a vendor chooses to implement the 3GPP interfaces inside their vEPC it won’t be as fast and as efficient but it can be used to showcase 3GPP communications inside NFV.

Good VNFCs software design is what will distinguish different providers of vEPC software solutions.

References:

http://www.etsi.org/technologies-clusters/technologies/nfv

https://www.opennetworking.org/images/stories/sdn-solution-showcase/germany2015/CENGN%20-%20NFV-based%20LTE%20Core%20in%20the%20Cloud.pdf

http://nfvwiki.etsi.org/images/NFVPER%2814%29000010r2_NFV_ISG_PoC_Proposal-E2E_vEPC_Orchestration.pdf

This is a copy of a previous Linkedin Post Dated May 16 2016 which was not present on this Blog.

https://www.linkedin.com/pulse/nfv-mano-management-orchestration-syed-habib-lutfullah-kakakhel/

MANO is the brain of the NFV Network. It is the part of the network through which control operations are performed on virtual network functions and virtual network functions infrastructure.

One set of v-eNB, vMME, vSGW, vPGW, vPCRF can be assumed to be a Network Service. Each of the above v’s provide distinct Network Functions which with the v’s are deployed as Virtual Network Functions on Virtual Network Functions Infrastructure. The Virtual Network Functions Infrastructure is hardware with the virtual abstraction layer providing virtualization. These are the acronyms.

Multiple virtual network functions are connected together, or chained together, to provide a network service. The physical links are in the infrastructure which is the compute/storage hardware equipment while the logical links are among the VNFs. The endpoint is the Network Service endpoint which is providing service to the end devices.  Between the physical links and the logical links sits the virtualization layer.

The NFVO i.e. the NFV Orchestrator is the part of the network which controls the deployment and operations of virtual network functions.

This is a copy of a previous Linkedin Post Dated May 24 2016 which was not present on this Blog.

https://www.linkedin.com/pulse/nfv-independance-from-hardware-lock-in-lutfullah-kakakhel/

NFV is simple. It’s most simplistic distinction is that it is the Telecom Operators name for hardware independence and software dependence. Hardware is locked in while software is more easily changed (a project manager would say: relative to hardware that is).

We can try to see what problem NFV seeks to solve.

Telecom operators faced a dilemma about hardware. “To launch a new network service often requires yet another variety (of hardware) and finding the space and power to accommodate these boxes is becoming increasingly difficult; compounded by the increasing costs of energy, capital investment challenges and the rarity of skills necessary to design, integrate and operate increasingly complex hardware-based appliances.”

The sentence starts with “to launch a new network service often requires yet another variety” (of hardware). Remember they want to compete with the Whatsapp’s and Viber’s of tomorrow and need agility of deployment.

NFV seeks to provide that ‘Agility of Deployment’ of new network services to Network Operators by taking away dependency on proprietary and vendor locked in hardware. That is the high level purpose.

The rest is architecture. Hardware can be any compute(r) node with associated storage (types) and an accompanied (inter)network of such devices.  Then it follows to make virtual services; Virtual Network Functions with Virtual Network Infrastructure.

To roll out software or new software for a new service is easier than to roll out hardware.

Another primary benefit is elasticity in energy consumption. Energy consumption according to demand. With more control of hardware, which is the energy consuming physical device, via dependence on software this is made possible.

Providing Layer 2 VPN and Layer 3 VPN services has been a requirement of enterprises from Service Providers. Similarly Data Center networks need to provide Layer 2/3 Overlay facility to applications being hosted.

EVPN is a new control plane protocol to achieve the above . This means it coordinates the distribution of IP and MAC addresses of endpoints over another network. This means it is has its own protocol messages to provide endpoint network addresses distribution mechanism. In the Data Plane traffic will be switched via MPLS Labels next hop lookups or IP next hop lookups.

To provide for a new control plane with new protocol messages providing new features BGP has been used. So it is BGP Update messages which are used as the carrier for EVPN messages. BGP connectivity is first established and messages are exchanged. The messages exchanged will be using BGP and in them EVPN specific information will be exchanged.

The Physical layer topology can be a leaf spine DC Clos fabric of a simple Distribution/Core setup. The links between the nodes will be Ethernet links.

One aspect of EVPN is that the terms Underlay and Overlay are now used. Underlay represent the underlying protocols on top of which EVPN runs. These are the IGP (OSPF,ISIS or BGP), and MPLS (LDP/SR).  The underlay also includes the Physical Clos or Core/Distribution topology which has high redundancy built into it using fabric links and LACP/LAGs. The Overlay is the BGP EVPN vitual topology itself which uses the underly network to build a virtual network on top. It is the part of the network which related to providing tenant or vpn endpoints reachability. i.e. MAC address or VPN IP distribution.

It’s a new protocol and if you look at the previous protocols there is little mechanism to provide all active multihoming capability. This refers to one CE being connected via two links to two PEs and both links being active and providing traffic path to far end via ECMP and Multipathing. 2 Chassis multichassis lag has been one option for but it is proprietary per vendor and causes particular virtual chassis link requirement limits. Ingress PE to multiple egress PE per flow based load balancing using BGP multipathing is also newly enabled by EVPN.

There is also little mechanism in previous generation protocols to provide efficient fabric bandwidth utilization for tenant/private networks over meshed-style links. Previous protocols provide single active and single paths and required LDP sessions and tunnels for full mesh over a fabric. MAC learning in BGP over underlay provides this in EVPN.

Similarly there is no mechanism to provide workload (VM) placement flexibility and mobility across a fabric. EVPN provides this via Distributed Anycast Gateway.

 

Two edge computing specifications are present.

Facebook OCP’s CG-OpenRack-19 and LinkedIn’s Open19.

They provide for Rack Layouts, Compute, Storage and Networking.

Networking for CG-OpenRack-19 is copied below. The servers sleds in the pictures appear to be single homed as per the colors. It would be interesting find out which protocol handles the Active Active state of the multi homed Compute and Storage Sleds if that is at all present.

OpenRack-19

Given here

Open19 gives 100G bandwidth capabilities and some details are on its website.

5G’s edge ultra low latency requirements would could require edge solutions and it would be interesting to see how things play out ahead.

This also brings to mind SD-WAN because these edge racks will be at least connected in a large WAN.

Google’s B4 is one of its software defined inter data denter WAN solution. Google’s Espresso is its peering edge solution. Espresso links into B4 domain via B2. This link has the details of Espresso as shared by the Google team.

 

Google-Espresso-B4.JPG

Google is not employing an army of networking engineers to run these because they are software defined and programmed bots will probably be doing operational tasks. To operate this network there are Site Reliability Engineers though.

Here is one public job advertisement that relates as to what an SRE is expected to be like:

We have reliable infrastructure and can spin up new environments in a couple of hours. Automate everything so there is more time for exploring and learning. Foster the DevOps mindset

What are our goals?
  Internationalisation
  Deploying multiple data centers
  Deploying every 5 minutes
Requirements
  Experience with Java or JavaScript in a Dockerised environment
  Linux Engineering/Administration
  Desire for improving processes
  Have a passion and most importantly, a sense of humour
Tech Stack (you DO NOT need experience in all of these)
  Kubernetes + Docker
  Terraform + Ansible
  Linux
  Kotlin + NodeJS
  ELK stack
  AWS

This is obviously an SRE for the servers side and the application enablement side of things. If there is a large software defined edge network like Espresso and a large Edge-to-DC network like B2 and a large software defined inter-DC network like B4 you will need a different SRE.

Here is Google’s version of a Site Reliability Engineer Job.

Job description
Minimum Qualifications

BS degree in Computer Science or related technical field involving coding (e.g. physics or mathematics), or equivalent practical experience.
3 years of experience working with algorithms, data structures, complexity analysis and software design.
Experience in one or more of the following: C, C++, Java, Python, Go, Perl or Ruby.

Preferred Qualifications

Systematic problem-solving approach, coupled with effective communication skills and a sense of ownership and drive.
Interest in designing, analyzing and troubleshooting large-scale distributed systems.
Ability to debug and optimize code and automate routine tasks.

About The Job

Hope is not a strategy. Engineering solutions to design, build, and maintain efficient large-scale systems is a true strategy, and a good one.

Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures that Google’s services—both our internally critical and our externally-visible systems—have reliability and uptime appropriate to users’ needs and a fast rate of improvement while keeping an ever-watchful eye on capacity and performance.

SRE is also a mindset and a set of engineering approaches to running better production systems—we build our own creative engineering solutions to operations problems. Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation. As SREs are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to solve a broad spectrum of problems. Practices such as limiting time spent on operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting and dynamic day-to-day work.

We can see that Google’s SRE Job Ad is all software along with large scale distributed systems requirements.

Now if we note this extract from the Wikipedia SD-WAN article:

“With a global view of network status, a controller that manages SD-WAN can perform careful and adaptive traffic engineering by assigning new transfer requests according to current usage of resources (links). For example, this can be achieved by performing central calculation of transmission rates at the controller and rate-limiting at the senders (end-points) according to such rates”

and we also note this extract:

“As there is no standard algorithm for SD-WAN controllers, device manufacturers each use their own proprietary algorithm in the transmission of data. These algorithms determine which traffic to direct over which link and when to switch traffic from one link to another. Given the breadth of options available in relation to both software and hardware SD-WAN control solutions, it’s imperative they be tested and validated under real-world conditions within a lab setting prior to deployment.”

We see Algorithms.

Its clear that there are different algorithms running these Software Defined networks (Google’s software defined Espresso, B2, B4 and Jupiter). These algorithms automate, kick in and optimize. Google becomes a large scale distributed system with various algorithms here and there. While Software Architects and Software Engineers will have developed these algorithmic nodes and programmed them into network devices/servers an SRE is the human who will operate the system. A team of SREs.

One aspect of Networking protocols is that they are for a multi-vendor, multi-enterprise and multi-domain environments. They provide simple consensus to connect two or more different network devices.

To take a merchant silicon network device like OCP’s Wedge and OCP style servers and make one large network like Google out of it will require software engineering to remake the NOS (Network Operating Systems) part at least. There will be atleast a Meta-NOS, somewhat running on top of a typical NOS which would handle the SDN – software defined algorithms. In addition to the SDN controllers talking to this Meta-NOS. Multiple layers of SDN controllers will be talking to each other and you can call this a network protocol or an SDN algorithm but it will be part of distributed systems software architecture and it will be programmed in place by software engineers.

Large Scale Distributed System on Merchant Silicon Hardware – Software Defined Meta-NOS – SDN Controllers – Hierarchical SDN Controllers – Algorithms.

Sounds like a Program Management task instead of PMP scale Engineering Project Management task. You will need Mathematicians to sit with Network Architects, Distributed Systems Architects and Software Architects. The Mathematicians will do give the algorithms. They will be important too.

Fun times.

Terabit scale networking requires better Consensus.

Autonomous Networks and Autonomic Networking can be renamed as solving Consensus Dynamics.

Wikipedia States (Nov’ 2018):

“Consensus dynamics or agreement dynamics is an area of research lying at the intersection of systems theory and graph theory. A major topic of investigation is the agreement or consensus problem in multi-agent systems that concerns processes by which a collection of interacting agents achieve a common goal. ”

To note again it is an ‘intersection of systems theory and graph theory’.

Lets not forget that mathematically communications networks are Graphs. An OSPF/ISIS network is a weighted directed graph where the costs & metrics are the weights, the network devices are vertices and the ethernet L2 links are directed edges.

Furthermore, to note again that ‘ a collection of interacting agents achieve a common goal’. In networks the common goal can be to enable end to end, host to host connectivity over a vast network. TCP and UDP.

Interesting times ahead for Terabit scale networks. Keeps the fun alive in network engineering.

References:

https://en.wikipedia.org/wiki/Consensus_dynamics

Click to access EncyclopAI07.final.pdf

 

 

 

Simple NAPALM Use – A Python based abstraction layer multivendor support capable. Part of screen scraping solution.

Ansible ConfigMgmt / Jinja2 Templates – Part of a CLI automation system solution which can be called sophisticated screen scraping. via SSH. No on-device agent or service.

Salt / NAPALM Logs – Event driven network automation. This is also a CLI automation solution and can be part of screen scraping from network device perspective. Event driven by NAPALM logs.

Netconf or Restconf with YANG – Connectivity is via Netconf/Restconf (JSON/XML) while configuration is via YANG data modeling available on device (which is a service on device).  Not screen scraping or CLI automation as YANG is a data modeling language providing service and can be used to extract and push state at device.

SDN based Cisco ACI like: Northbound Rest API on APIC controller and Southbound OpFlex with an OpFlex agent on device. On device Policy Element abstraction service.

Good reference links for exploring of the above:

Ansible + Jinja2 Option:

https://networkotaku.wordpress.com/2017/10/24/network-configuration-templates-with-ansible-and-jinja2/

Salt + NAPALM Abstraction Option:

Click to access 17-RIPE76_-Event-driven-network-automation-and-orchestration.pdf

https://mirceaulinic.net/2017-10-19-event-driven-network-automation/

https://my.ipspace.net/bin/list?id=xNetAut181#SALT

Links for Restconf + YANG Option:

https://networkop.co.uk/blog/2017/02/15/restconf-yang/

https://networkop.co.uk/tags/ansible-yang/

https://packetpushers.net/podcast/pq-show-116-practical-yang-network-automation/

Links for SDN Style Cisco ACI – Like:

https://wiki.opendaylight.org/view/OpFlex:Opflex_Architecture

https://www.cisco.com/c/en/us/solutions/collateral/data-center-virtualization/application-centric-infrastructure/white-paper-c11-731302.html

https://www.cisco.com/c/en/us/td/docs/switches/datacenter/aci/apic/sw/2-x/rest_cfg/2_1_x/b_Cisco_APIC_REST_API_Configuration_Guide/b_Cisco_APIC_REST_API_Configuration_Guide_chapter_01011.html

 

This post is not in English. 🙂

L2, ETH, M ADD, SRC/DST, M TABLE, ARP TABLE, GARP, VLAN, BCAST DOM, MTU, PMTUD. AppPMTUD, JUM FRAMES, 1500, 9000.

STP, ROOT Port, BPDUs, block, listen, learn, forward, disable, RSTP, MSTP, TCN BPDUs,

Swi, Hub, Rtr, Gway,

IP, SUBNET, ETH TY, SRC ADD/DST ADD, TTL, IP Add(Net:Host), CIDR

MPLS, LDP, LSP, LBL, EXP, PHP, FRR

TE, RSVP, LBL ST, LSP. Backup LSP via MPLS TE.

L3VPN, ROUTE, PE-CE , LBL Stack, VPNV4, MPBGP, AFI/SAFI L2VPN, RD,RT, VRF

VLL,L2VPN, VPLS, ELAN, TunnMTU,

OSPF, NBMA, MCAST, AREA, DR, BDR, Areas, LSDB, Rte Sum/Rte Fil b/w areas, Stub (No AS Ext LSA) , TotStub(No Sum, No Ext) , NSSA (Type 7 Ext Transit) , LSA, NET LSA, RTR LSA, EXT LSA, ASBR, ABR, SUMM LSA, ASBR LSA, ASBR SUMM LSA, SPT, HELLO, DEAD TIME, RTR ID,

BGP, TCP 179, TELN 179, PING, FW 179, NHRI, ROUTE, COMMUNITY, B PATH, LNG PREFIX M, WEIGHT, LPEF, ASPATH, IGP/EGP/LOCAL, PREPEND, BlaHol COMM, NHOP, PIC, PREFIX FIL, AS PATH FILT, COMMUN RPL LPREF Actions,

RR, CLUSTER, IBGP, B PATH, vRR

RIB,FIB, AD DIST,

ACL, DENY, ALLOW,

ASIC, CEF, RP, MEM, LC, CBAR FABRIC, LINE RATE, CUT THROUGH.

DC, CLOS, LEAF, SPINE, IGP, IP, LOOPBACK, BGP, ROUTE, VXLAN, MACinUDP, VRF,
MPBGP, EVPN, MAC-ROUTE, MAC VRF, EVI, ESI, SDN, BGP LS, PCEP, BDR LEAF, L3OUT, VRF, MPBGP, PE-CE, DC L2OUT, BD, VLAN, MTenant, BRI DOM, END POINT, 1-MAC/+IPSubnets.

TMETRY, PULL, GRPC L4, SNMP PUSH.

ANSIBLE, NETCONF/YANG, APIs, YAML, WSPACE, HUMreadable, XML/HTML, JSON, JINJA2, PYTHON.

TCP, SYN, ACK, SYN ACK, ECN echo bit, FIN, RST, SL WDOW, PORT, SEQ, P BIT/ImmSend, MSS, RELIA, Ordered.

UDP, CLESS, SPORT, DPORT, Length, checksum. DNS, DHCP, SNMP, Jitter, latency VoIP, unordered, mcast, bcast.

DNS LB, GSS, GLB, ANYCAST GWAY.

DHCP DORA, IP, BCast, Unicast.

DCI, L2/L3, EoMPLSoGREoIP, L2VPN, VPLS, EVPN, MP-BGP VRF.

SAN, SYN/ASYN, latency, DWDM/CWDM, iSCSI, FCoIP.

Workload Mobility, MobIP, LISP, ProxyIP.

QoS, DiffServ, MPLS EXP, E-LSP, L-LSP, per Class DiffServ Aware MPLS-TE vi RSVP Sig.

Linux, Expect, BASH, Python, AWK, SED, GREP, CRON, VIM, NANO.

Scri, D-Types:No, Stri,List (mutable),Tuple (immutable), Dict (key,value), Variables, Arrays(C), List-Stack (LIFO), List-Queue (FIFO).

Cond Prg, If, Ifelse, ElseIf, NestIf, While (Condition True) , For (Iterations known), Break, Continue, For Else, built-in functions, User-def-functions. Library, Framework, local vari, global var.

 

Layer 1, Layer 2, Layer 3 and Layer 4. Physical, Link Layer/MAC Layer, Network Layer, Transport Layer.

Physical is Physics which is improving allowing more bandwidth limits.

Layer 2 sitting atop physical involves Links/Mediums access control mechanisms between devices. It provides bits data transfer over the physical connectivity and involves payloads/addresses.

Layer 3 involves connecting multiple networks and forming an internetwork which further provides network level end-to-end connectivity.

Layer 4 involves end to end host level connectivity.

Within Layer 2 we have software enabled Virtual LANs, we have loop avoidance via Spanning Tree, we have Link aggregation via LACP etc.

Within Layer 3 we have IGP,EGP,VPN,SP,DC,WAN,TE, QoS and what not.

Within Layer 4 we have TCP,UDP; Connection-oriented/Connection-less; Flow-control, windowing; Reliability, acknowledgements, sequencing; Error control, checksum; Port numbers, etc.

Layer 2 and Layer 4 are relatively ‘localized’. Layer 2 due to its physical/link level vicinity and layer 4 due to its in-host & between-host proximity. While there is science in these layers it is somewhat local.

Layer 3 involves much geography. It is the domain which deals with providing end-to-end connectivity spanning much space and area. With this comes much management. It entails reachability across multiplexed systems via addressing, reliability via multi-pathing, reachability status communication, path preferencing in multipath options, path avoidance, virtual privacy and isolation across multiplexed systems, time management for timely fault tolerance and fault bypass. geolocation based path selection and load balancing. etc. etc.

Hence the birth of large-scale Internetworking Protocols.

Protocols which are engineered to have some have mechanisms built-in & agreed upon while some options require configurations.

Autonomous Networks which have all the mechanisms built-in and require no configurations are not present at the moment, except perhaps somewhere inside Google et al.

For now we have to sift and select between options and configurations for making data flow.

An automated multi-tenant data center network is an increasingly desired end goal for large and small organizations including providers. Servers that house the CPU, RAM and Hard Disk resources are serving traffic for applications they host. These servers need connectivity among themselves within the data center and also towards the outside world.

At first an organized set of CPU/RAM/HD Servers are connected to a network device. This happens in a Data Center rack and the network device is a ToR, a Top of Rack switch.  Another similar set of servers is connected to another Top of Rack network device. Multiple such sets of servers/network device pods are then linked together. The incumbent way to do this would be to make a leaf-spine Clos fabric. The layer of network devices connecting the servers are the leaf layer and the layer of network devices that is connecting these leaf nodes is the spine layer.

Hardware is thus laid out in a 2-stage or 3-stage Clos fabric and then we need to lay out a logical control plane to pass traffic. Applications on the Server CPU/RAM/HD will talk to each other within the DC which is east-west traffic or to the outside world which can be called north-south traffic.

Depending on the type of application east west traffic could be higher but north south traffic is always present.

Moving bits from a server to any other location is the networks job. These bits could be a compute hosting virtual machine’s bits or a ‘Serverless’ cloud application’s bits but they  go somewhere and are moving. They are moved by the network layer regardless of what resides on the servers.

How many layers of protocols and software are required to provide for an automated multi-tenant data center network which can connect servers, host applications and provide east-west/north-south connectivity ?

In the Networking Components blog post some basic networking components were listed out in a different construct: Network Device, Protocols, Protocol Messages, Addresses, Lookup tasks, Identity Tags, Filters & Actions, Network Over Network ( Overlay) Appended Information, Network + Network , Network Inside Network Device, Control and Data Plane.

In the Event-Driven Network Automation blog automation details were described.

The below will make some use of the networking components and event-driven network automation blog posts.

At first you need Addresses appended onto payload bits to ascertain endpoints and exchange traffic. How many layers of addresses will be required to connect the servers to each other over a fabric? In a full mesh structure the networking layer is small/direct and less addresses are required. In a Clos Leaf-Spine-Leaf fabric there needs to be multiple layers of addresses required.

A packet/frame structured bits data structure is switched across multiple nodes. In terms of Addresses Ethernet MACs are used for Layer 2 connectivity between servers NICs and ToR ports. The server could also have an IP Address of its own and be performing Layer 3 communications.

One server connected with one leaf could send an IP packet to another server connected with another leaf (Server<>Leaf<>Spine<>Leaf<>Server). As parts of the Control Plane of laying out the fabric the leaf and spine network devices will have IP addresses of their own which will speak to each other and send Control Plane Protocol Messages. What this infers is that there will be present 2 layers of IP communications. One between the network nodes themselves and one between the servers. This infers the requirement to have an IP address pushed on to another IP address in a tunnel type structure where from one network device to another (e.g. leaf to leaf via spine) the packet is routed based on Outer IP Addresses and the inner address is used by the server. Therefore some packets will require an addressing structure such as IP|Eth|IP|Eth. The IP Tunnel will span from a Leaf to another Leaf via the Spine, therefore the tunnel endpoints are at the Leaf switches. (Server-IP<encapsulation>Leaf-IP<>via Spine <>Leaf-IP<decapsulation>Server-IP)

We have multiple combinations of communications to deal with in multiple layers of the networking stack.  Leaf-Local L2, Leaf-Local L3, Leaf-Spine, Leaf-Spine-Leaf L2, Leaf-Spine-Leaf L3.  All this calls for multiple domains. A ‘Local’ Link Layer Domain, A Local Network Layer Domain, A Distant Network Layer Domain, A relatively distant Link Layer Domain. A link layer domain could be an L2 VLAN/broadcast domain or a bridge domain and a network layer domain could be a local VRF or a wider-spanning IP-in-IP domain level routing instance.

… A routed layer has IP addresses at two endpoints and an Ethernet link has MAC addresses at two end points. A virtual machine of a tenant in a server can have both an IP address and a MAC address. There could also be a single virtual machine having multiple subnets IPs behind the same MAC address ethernet link. This virtual machine is an endpoint and is this is what the network layer needs to provide connectivity to. Therefore we could say that an endpoint requires at least 2 tables at the network device it is connecting to. An IP Routing table and a MAC table. An ARP table is also required for Inter-Layer discovery. There is also the Leaf-Spine-Leaf IP-in-IP tunnel we spoke about which adds another layer of overlay Routing Table. In addition an outer IP to inner IP socket-style mapping function will be required which is another table (L4 Socket of Outer IP to Inner IP).

Discovering the places of destination-address lookup-actions happening in a network always helps discover the kind of networking happening.

So a Leaf-Local L2 frame (a server sends to another server connected to the same leaf) would be switched locally with the local bridge domain/mac table. A Leaf-Local L3 packet would be routed by the local VRF. A Leaf-Spine-Leaf Packet would be mapped to the relevant far-end leaf tunnel endpoint and a tunnel endpoint IP would be pushed on it; it would then be tunneled/IP routed across the spine to the destination leaf; the destination leaf would then look at the socket-style mapping table of the destination endpoint; it would then pass onto the final destination endpoint.

While the Leaf-Local communications can be handled within the network device by tables, mappings and local lookups, it is obvious that when crossing the spines and reaching for a far end leaf there is a need for a control plane to communicate the far end addresses and mappings. A Protocol to exchange the distant leafs addresses and mappings which establishes the control plane for traffic to be switched and routed between leafs across the spines. There is a spine in the middle and the leafs are not directly connected. A Control Plane to distribute addresses and mappings.

There is a choice here.

For this Leaf-Spine-Leaf addresses exchange & inner/outer mappings population we could use a distributed, nuke-tolerant, internet style packet layer protocol OR instead use an SDN style central controller to do the thinking and push/program the network devices with all the addresses and mappings. The devices need to be populated with far end addresses and mappings and both will achieve this goal.

Our topic is an Automated Multi-Tenant Data Center Network and the automation part of the name is supported by the SDN style.

Why ?

The reason is that any distributed, nuke-tolerant, internet style protocol inherently requires independent configurations on all networking nodes which then enable the devices to start communicating. While an SDN controller is a single configurations point which pushes the configs onto the devices. This means that from an automation standpoint you will be either automating the configurations of hundreds of devices or an SDN controller. Configuring all devices in a large data center fabric independently is difficult to automate while managing the automation of an SDN controller or even levels of SDN controllers is easier.

This Data Center will need to speak to the outside world too.

This means that there will be a border functionality which will provide L3 and L2 reachability to the outside world. i.e. Ethernet L2 connectivity; VLANS or bridge domains, extended from a server in a leaf to a border node leaf and onwards to an outside world L2 construct, say an MPLS L2VPN.

Similarly an L3 VRF extension  where a set of routes of an endpoint/server/tenant are stretched onto a border node leaf’s VRF via say MP-BGP style RD/RT mechanism where they are further extended onto an outside world MPLS L3VPN via a PE-CE routing protocol. (Tenant-Routes|VRF|MP-BGP|VRF| VRF-PE <> CE|Outside World).

Our topic also contains the word Multi-Tenant which means that in the case of L3 Multi-tenancy a legacy MPLS L3VPN style VRF/MP-BGP mechanism will be needed per tenant per VRF.

A similar mechanism is required for connecting two or multiple such large data centers between themselves. So “Endpoint<>Leaf<>Spine<>Border-Leaf<>|Infra-Link|<>Border-Leaf<>Spine<>Leaf<>Endpoint” communications are then possible. For L3 VRFs/MP-BGP can provide separation and ensure multi-tenancy for this Inter-DC comms.

When required some border leafs will obviously connect to routers which speak eBGP to the outside world. Other Autonomous Systems over Transit and Peering connections. These routers will have the global routing table and will gateways to the rest of the world. The PE-CE communications mentioned above for VRF stretching can be Static/OSPF/EIGRP/BGP.

To run this Automated Multi-Tenant Data Center lets not forget an overarching Orchestration software residing atop it providing a GUI mechanism into the wide array of options, tools, clogs and combinations to enable tenants Intra-DC, Inter-DC and outside-world communications.

Enabling application endpoints to communicate via a network requires a whole bunch of protocols in the networking layer. Different protocols providing different functionality each providing a brick making a wall which is achieving the end goal of endpoints communication.

Physically after transceivers have delivered ordered bits in a memory location in a network device they are digested. It could be any of a number of control plane or data plane datagrams that the network device needs to digest.

It could be a Layer 2 MAC / Ethernet layer frame aimed at information transfer within the local area. It could be an ARP control plane frame. It could be IP address reachability information like an OSPF or IS-IS control plane packet. It could be a TCP handshake packet or a TCP Payload packet. It could be a UDP packet. It could be a BGP Update message providing next layer (IP) reachability information.  It could an MPLS labeled packet being switched across through an IP core network.

It depends.

Regarding Reachability Wikipedia states:

” In graph theory, reachability refers to the ability to get from one vertex to another within a graph. A vertex s can reach a vertex t t (and t is reachable from s) if there exists a sequence of adjacent vertices (i.e. a path) which starts with s and ends with  t.

In an undirected graph, reachability between all pairs of vertices can be determined by identifying the connected components of the graph. Any pair of vertices in such a graph can reach each other if and only if they belong to the same connected component. The connected components of an undirected graph can be identified in linear time. The remainder of this article focuses on the more difficult problem of determining pairwise reachability in a directed graph. ”

It’s interesting that mathematically a network is a Graph and a networking device is a Vertex but we’re blogging on networks and not on math.

BGP

BGP neighbors are manually configured to utilize a TCP connection at port 179 to exchange IP address routing information. This is the most common use on the wider Internet where transit providers use BGP to exchange IP routes of connected networks. A large service provider which sells internet transit uses BGP to peer with similar other service provider networks and with server hosting providers.  BGP can also be leveraged to advertise information other than IP e.g. MAC routes in EVPN.

Practically speaking any two routers with an established BGP connection send update messages to add and withdraw IP Prefixes (routes) and the routes attributes (AS Path, Community etc.).

BGP has a full finite state machine diagram where a session transitions from Idle state to Established state. Initially Idle it transitions to Connect, OpenSent when Open message is sent, Active state, OpenConfirm where both sides have sent Open message and then Establised where a final acceptance Notification message is sent and thereafter keepalive messages are exchanged. In the OpenConfirm state the two BGP ends have both sent Open messages to each other and are checking the information to see if a BGP session with this peer should be established. The primary information in the Open message include the BGP version number, the AS number, the hold timer, the bgp router id and the optional parameters.  The optional parameters contain TLVs which negotiate attributes such as MP-BGP extension to be used between the peers.

Once established Update message is sent with the routing information and route attributes. Every Update message causes the BGP route table to update and route table version number to increment. An update message contains unfeasible routes, path attributes and NLRI which are IP routes. Path attributes such as AS_Path, LocalPref and MED are present in the Update message.

iBGP as opposed to eBGP is used to communicate routes with an Autonomous System. The AS_Path is treated different in the case of iBGP where a router only adds its own AS number in the path if its speaking to an eBGP peer and does not add its own AS number if its speaking to an iBGP peer. Otherwise if the BGP process sees its own AS it would drop the route assuming a loop. Either a full mesh is required for iBGP so that every router knows every destination of a Route Reflectors could be used to peer with iBGP speaker and reflect routes. Routes received from a client in an RR setup are reflected to other clients and non client neighbors.

One of the mechanisms in BGP is the best path selection methodology. If an IP prefix is reachable from multiple paths BGP has a list of if else steps through which it transitions to select one best path and advertise that.

The best path selection criteria are given below.
1) Weight (Cisco locally assigned – higher weight preferred)
2) Local Preference – Prefer path with higher local pref
3) Network or Aggregate (Cisco local route vs aggregate route)
4) Shortest AS_PATH  (Prefer path with shorter as path)
5) Lowest origin type IGP < EBGP
6) Lowest multi-exit discriminator
7) eBGP over iBGP
8) Lowest IGP metric
9) …

Another aspect of BGP is the route filtering and route manipulations via Community attributes. Where a community attribute is sent in a numbered format e.g. 6939:400 to trigger an impact on the far end neighbor path selection. For example if one neighbor send 6939:400 community to another neighbor the receiving side will set Local Pref of the route to 400 based on a previously agreed upon understanding. This is achieved by if-then-else route policies are the receiver end.  Commonly used communities include Local_Pref setting communities and blackhole communities.

Another aspect of BGP is Multihoming and traffic load balancing. If one autonomous system is multihomed to another autonomous system it will use LPref, Communities and AS Path prepending to influence traffic.

BGP has also been used as an IGP alternative is Massive Scale Data Center deployments using Clos fabrics.

BGP is flexible, scalable, stable and reliable but it is slow in convergence, has limitation is terms of load balancing and requires large CPU/TCAM in case of large routing table sizes.

 

References

https://learningnetwork.cisco.com/blogs/vip-perspectives/2017/12/14/demystifying-bgp-session-establishment

https://www.inetdaemon.com/tutorials/internet/ip/routing/bgp/operation/messages/update/

Click to access BRKRST-3320.pdf

https://blog.ipspace.net/2017/11/bgp-as-better-igp-when-and-where.html

http://huzeifabhai.blogspot.com/2011/08/eigrp-ospf-bgp-strengths-weakness.html