In IT there are different types of works. Not everyone realises this and not everyone knows which field of work they are in and which they are not in. For example an operations engineer who has worked in operations for a considerable time of his career might find it difficult to adjust in a deployment role or an architecture position. If you are in one line generally the hiring party knows your line and try to select a person from within that line. There is a difference between running an already built system and setting up a system. There is a difference between setting up a system already designed and designing of a new system. The design of the system is dependant on the requirements from that system and the tools and protocols you have at hand. The requirements of the application dictate the design. This work of requirement analysis and design is different from installing the system. Furthermore design and install are both different from running a system continuosly. These are all different skills and some people know all three and settle in one while some people know only one and work within that one. Generally these are titled Operations which is running a system and Deployment which is installing a system and Architecture which is designing a system.

Installation and Deployment fall under project execution work and Architecture and Design fall under project planning. Running a system and operations is generally considered non-project. It is good for an engineer to know the line of business he is in and choose to either intelligently acquire further skills within his line or acquire the skills of another line and move into that one. An Operations engineer might work in Ops for a few years and learn design skills and try to move into Deployment Project work. He may then move onto Design and Architecture work. Operations work are normally 24/7 all week 365 days of the year and this require weekend shift and night shifts and oncall work. Project and Deployment work are generally day time office hours work but the site installation work is somtimes done after hours in a planned maintenance window. Architecture and Design is mostly 9-5 business hours work. Some operations roles are now done internationally in a follow-the-sun manner across countries and across timezones. This means that in one country when it is daytime their engineers are on call and are running the system and after sunset another country wakes up and engineers in that timezone are handling the operations in their daytime. This is called follow-the-sun operations and in organisations running like this ops work also is in daytime only. Normally these are large organisation spanning the globe with presence in multiple countries.

Habib

I came across a job ad titled Systems Reliability Engineer which turns out to be a sort of a hybrid engineer skillset. Its details are copied below. Bear with me while I break things down.

The hybrid part in this is that it requires a combination of:

   – Linux/Unix/Virtualisation which falls generally under SysAdmin roles.

   – Networking which falls under Network Engineer roles

   – Storage and Server which generally falls under Storage/Backup Engineer

   – Kubernetes which is a container orchestrator and will provide a platform for a distributed application. This is a new field but I think its safe to say that Devops Engineer or Platform Engineer role titles handle this responsibility.

   – AWS/Azure/GCP Cloud which are Public cloud IaaS, PaaS or FaaS services.  This falls under Cloud Engineer or Devops Engineer.

A combination of the above knowledge bank is required to function as a Systems Reliability Engineer here.

And so we can say that a Systems Reliability Engineer is composed of a SysAdmin, Network Engineer, Storage Engineer, Devops Engineer, Platform Engineer and Cloud Engineer.

Can we break this down a bit more?

Starting with the Application workload, suffice to assume that the heavy weight applications which this guy will support require a networked distributed system to run. They are cloud native microservices based applications requiring a networked distributed system to run. The application needs CPU cores, RAM, Storage, IOPS, Bandwidth at such a scale.

Digging in further it can be observed that the individual components require an OS and Virtualisation (Linux/Unix/KVM etc – Sysadmin). Networking these individual components require L2/L3 networks (Routers, Switches – NetEng) and further on what can be called a Distributed System OS is required which will present not individual components but the servers/OS/router/switches/vswitches/ storage combo to the application. Kubernetes can be said to be the Distributed System OS providing orchestration and management of namespaces/containers. A distributed file system and storage servers will also be present. Certain parts of the application may be interacting with public clouds (AWS/Azure/GCP) to run certain workloads on public cloud instead of on the local infrastructure.

Oh dear, what a combination of knowledge bank and skillset this person needs!

In Computer Science we work on the principle of Abstraction Layers where there are layers which have science and phenomenon within themselves and then they provide a function or service to another layer. And so the whole system is composed of multiple Abstraction Layers interacting with each other. In this case this Systems Reliability Engineer requires knowledge spanning multiple Abstraction Layers. Traditional engineers have been functioning within their own Abstraction Layer. Their specific jobs have been complicated enough to require tips and tricks of that same abstraction layer to make things work. An engineer working in the networking abstraction layer knows how to troubleshoot links, routing, SFPs etc and an engineer working the SysAdmin layer knows what to do with the Linux OS, KVM etc etc. Similarly an engineer working on the Public Cloud may actually know the tips and tricks of 1 or 2 public clouds and not all 3. Kubernetes and container management is itself now an Abstraction Layer.

This job advertisement not only lists multiple abstraction layers but even within them it lists multiple tools. For example within Virtualisation it lists KVM, ESXi and HyperV all 3 famous hypervisors and within Public Cloud it lists GCP, Azure and AWS all three. So not only does it span abstraction layers but even within abstraction layers it is asking for familiarity with multiple versions of software.

In IT Operations knowing the right command or the right place to click sometimes matters a lot. Things dont proceed if you dont know the command or dont know where to click or what parameter to enter. Spanning Abstraction Layers and multiple tools within Abstraction Layers is a tricky job for IT Operations. I am guessing they will have a team and will manage the skillset of the team and not individual engineers. Multiple engineers with basic knowledge of the system and specific knowledge of 1 or 2 Abstraction Layers and 2 or 3 tools. The team level skill set management would be an important aspect here.

The rest of the job description suggests this is an operations job as they required full work week availability and troubleshooting skills as well. So this new hybrid engineer will be tasked with on shift troubleshooting work supporting customers and speaking to vendors etc. It is important to note that this is not a Project Deployment or Professional Services job where you are reviewing designs, testing solutions, submitting BOMs, reviewing equiptment lists, counting item, installing systems and configuring systems from scratch. This is an Ops tshooting break-fix role. As such it requires a troubleshooting mindset and will require sufficient knowledge of the systems functions and the individual components to identify which part of the system is causing a bug or service impact. Once you identify which part is broken (eg networking or virtualisation) then you might need to dig a bit deeper and review some logs within that component to a certain level. Thereafter they will make an intelligent decision on either actions to fix the component or whom next to contact to fix the problem. Each individual component will have their own level 3 support structure and vendor and this Systems Reliability Engineer will identify whether networking is broken or virtualisation is broken or storage is broken etc etc. He will then attempt a certain level of fix and if not then consult the right team or vendor.

As such when we look at the multiple skill sets required it looks very very complicated for one person to know all this. From my experience of 13 years IT still ongoing we are still in a siloed world where possibly a network engineer with a ccnp is progressing towards senior network engineer and CCIE or maybe only diversifieng with an AWS or Azure skill. A comprehensive non-siloed cross abstraction layer engineer with kubernetes, storage, public cloud, networking, virtualization, linux knowledge will probably be difficult to find because from what I see a lot of people are comfortable within their abstraction layer and such diversity is not necessary and is a big headache. Within networking which is my field I feel that network engineers are probably proceeding with deeper design knowledge or AWS/Azure diversification or Python Network Automation knowledge as a career path. Same might be true for say engineers within the Virtualization / Sysadmin layer who might be developing inside that abstraction layer. Further tricky is the part that you need this cross abstraction layer engineer to have ops and troubleshooting mindset willing to do shifts on weekends. There will be few people out there. Perhaps some incentives might be required to find the right diverse engineer working weekends. Incentives like permanent work from home or any nearby country accepted working the right timezone etc.

These are the new Hybrid Engineers.

Update: I later came to know that they have mentioned that they require 2 or 3 out of the skill set. So it appears they aee dividing skillset on a team level.

Habib

Privacy Engineering has become a new subject. Recent implementation of the GDPR law by Europe and similar laws following up elsewhere are affecting the internet. The internet is a beast which is difficult to control. There are forums and platforms on it where collaboration is fast and speedy which is difficult to dictate and control. A persons otherwise private information is stored on servers in distant places.

Recent concerns include that of children and their online exposure. Children and youth become members at various platforms on the internet and their online activity is stored on the servers of the service they use. This means that their messages, writings, exchanges, search queries and any and all activity is present in some form on the internet. Whichever service they use will have networks spanning the globe and this data might well be present in a different country. This means that the childrens and youths data of one country is present and stored (and possibly analyzed) in another country. This will be objectionable to the elders of that country who might want to retain the data of their citizens within their country. To this end it appears that Data Centers might need to be established in various countries which enact laws to keep their citizens data in-country.

One more concern is that large corporations of another country might have the data of your citizens. This means that they will have to comply to the laws of the country they belong to. This therefore means that any legal request for Data will have to be complied with and so your countries citizens data could well be legally shared to another country’s state agencies. Its a mixed bag of items. You have access to a free online service which shows you ads and provides free connectivity and free collaboration services but as a result it has your data.

One thought process is ‘Who Cares’. As long as you are a normal peace loving citizen of any country who goes online to express ordinary acceptable views according to acceptable standards then who cares if your Data is here, there or elsewhere? Another thought process is ‘I do care’ and don’t want my Data to be present anywhere at all and don’t want my data to be accessible by any entity of any state.
The fundamental question arises whether one considers the state as a friend or as a foe and whether one trusts the state or doesn’t trust the state. The fundamental question arises whether one consider their own state or the foreign state with their data trust-able or not. These are all valid questions and one persons view will be different from another persons view on this subject.

Do you want to be blackmailed or maligned at any point in your life based on any of your past internet activity ?
Do you want your state to be able to blackmail or malign you based on your internet activity?
Do you want another country to be able to blackmail or malign you based on your internet activity?
Is your online data from your past online activity such that if it goes public it could create problems for you ? Or is it such that you are afraid that if it goes in any enemies hands they could blackmail you ?
Do you want your children to be unknowingly doing stuff online at a young age for which they could be maligned or blackmailed at a later date ?
These are also valid questions and one persons view on this may vary from another persons view.

From these questions arises the field of Privacy Engineering where the technical, societal and legal aspects of privacy are raised. These are indeed difficult but valid questions. If a person with ill intent gains access to your online data this could definitely affect your life.

The internet is a battleground for control and a battleground for civilizations competing as well. One persons acceptable content is another persons evil and unacceptable content. What is evil content and what is good content is not agreed upon between different groups of humans. Therefore they compete on the content on the internet.

The Internet also poses one of the most difficult situations existing on the planet for parents. Parents across the globe who are knowingly or unknowingly defenders of the human spirit in their children find it difficult to put their children on the internet. It is such a challenging exercise to preserve the human spirit of your child if their heart and mind gets exposed to evil inputs from the internet. Parents knowingly or unknowingly care for their childs human spirit and want to raise a good human being. This means that the eyes and ears of their children do not get exposed to evil content and evil inputs. The eyes and ears are direct paths to the mind and the heart and affect the hearts of the children and affect their spirit and affect the childs character. Access to the internet poses a grave risk and allows possible access to otherwise dangerous content.

This poses a difficult situation where an attractable instrument exists in everyones pocket which exposes their childs eyes, ears, heart and mind to content which they consider evil, inappropriate and unacceptable. At a young age if the child starts browsing content on the internet and their immature mind gets exposed to information or content which they are not yet fit to absorb they then begin to think about things which perhaps they cant handle at that age. This is one of the most difficult situations on the planet for parents.

This presents a field which could be named as Internet Morality Engineering. Parents I guess would be interested in Morality Engineering. (Morality: principles concerning the distinction between right and wrong or good and bad behavior)

This is again a mixed bag of items where one mans morality is another mans immorality. Again the Internet here becomes a battleground for a clash of civilizations and a clash of values and a clash of cultures.

It is indeed a sad situation at the moment in the worldwide community that the internet is affecting the children and the youth of the planet in negative ways. It comes as a mixed bag where one could use the internet to gain insight into any subject technical, scientific or social but it could negatively affect childrens thoughts, mindset and character as well.

The internet and technology is possibly or definitely a cause of child and youth mental health problems. Exposure at a young immature age to mature content will most definitely lead to mental health problems. On this issue again the states of the world are inactive and the children and youth are getting exposed to everything on the net.

At what age should a child have a smartphone with internet access ? This is one of the most difficult questions for parents. The Morality and Privacy concerns regarding the internets content put much stress on the parents of the world.

It is an unfortunate state of affairs regarding the internet industry today that there aren’t much good controls present which would control childrens and youths internet exposure. States and governments don’t appear to be doing much to preserve their childrens or even adults minds, hearts, eyes and ears from evil, unacceptable content on the internet.


There seems to be a total lack of understanding of ‘evil and unacceptable’ and its definition and there seems to be no state or government efforts underway to prevent exposure to evil things on the internet. This again becomes a case for Internet Morality Engineering which would debate on evil and evil content and goodness and good content. A contentious issue which would definitely be debated heavily but do you want to not debate it today and have the children of the world
and children of your country or even your own kids to grow up with mental health problems ?

The hearts, minds and the spirits of the children of the world are at stake and Internet Morality should be debated heavily. The Internets governing laws around the globe should address Goodness, Evil, Morality, Good Content, Bad Content, childrens exposure, child mental health and youth mental health problems. This should not just be left to private individuals and small groups. It must be addressed in government forums at a large scale. It is important for the children of the world and the parents of the world and the future of the internet and the future of humanity to have some control exercised on the internets content.

The ears and the eyes are means of reaching the heart and the mind. It should be debated what the ears and the eyes of children, youth and even adults get exposed to on the internet. What is acceptable exposure and what is unacceptable exposure ? It is a difficult debate but is important.

I would advocate that all the content on the internet must be ‘tagged’ like the content on the TV. Content on TV is tagged G (General), PG (Parental Guidance) and M (Mature) etc. and there are tags on content on streaming services, movies and games. I believe the same should be done for everything on the internet. There should be a G switch for the internet where it can be enabled on a smartphone and a browser and access becomes restricted to G rated Internet much like
streaming services parental guidance mechanisms. This will require at source i.e. at server side programmatic enablement.

In addition a home and the community is a controlled exposure by parents for their kids. I would also advocate that much like this there should be a service available on smartphones and all internet platforms where a parent controls exposure to only allowed online acquaintances for their kids.

I would say that as there is IETF – Internet Engineering Task Force there should be an IMETF Internet Morality Engineering Task Force.

This is very very important. Keep in mind that not every parent is fully educated or fully literate or fully tech savvy to control content at user end. There should be at source support for Internet Morality Decisions.

Habib

Shuffling large amounts of data around is the new thing. A cloud there, a cloud here, another third cloud there ; all connected and giving Multi Cloud. But will you ever move a large amount of Data off a cloud ? Or on a cloud ? Or between 2 clouds ?

What is the Data Shuffling Cost between clouds ? Is there Data Shuffle Lock-in ?

Imagine Serverless Lock-in and Data Shuffle Lock-in. Consider this with Data Gravity and Data Sovereignty.

Consider Data creation location, Data transport mediums, Data handover location, Data processing location and results publishing location.

Consider the Stream Processing nature of Data results. Systems handling data at large velocity.

The concept that Data resides on storage places and that Data transfer takes time brings about the concept of Data Gravity. Wherever a GB is created it must be stored and transferred as well. This requires storage space and network bandwidth.  If as in 5G there are mobile phones and other electronics creating a lot of Data then Edge processing should be the way for some applications due to the created Data’s gravity. If all of the created Data needs to be transferred to a Data Center far far away for processing then that will require network throughput which is slow and costly. This would mean that some calculations, inferences and data processing will probably be done at the edge.  In addition if Data Sovereignty laws are present in the country then Data will need to be stored within that countries boundaries at the country edge.

But it is interesting to note that Google search results are Anti Data Gravity where text is entered and a result appears very fast. The data of the search item is small and processing is done at backend data centers that are both near the edge POP and far far away. If search text is small and can be handled without much concern of Data Gravity then can other things be done similarly? At what data size and at what network speed are other similar processing works possible for other applications.

There must be a balance between Edge Processing for some applications and Central Processing of some Data then. 

These concepts don’t seem new. I think in Computer Science Intel’s Multicore CPUs and Oracle’s distributed databases will have dealt with these already. It is only that geographical size has changed and instead of a KB it became an MB and then now it is GB. It is only that instead of a single circuit board and a local area network the network has grown larger. Some of the basic concepts of Storing, Retrieving, Interlocking, Delay, Latency, Queuing, Caching, Paging and Number crunching etc would probably be the same even now. 

Habib

Infrastructure as Code has two main sections to it. The first is running the code itself and executing the change onto the cloud platform. The second is maintaning the version control of the code. i.e multiple changes by multiple people to the code.

For executing the change one could use Ansible or Terraform.

From a high level you simply run an Ansible playbook while having changed the variables files to make changes to the environemnt. One could do this without getting too much into the way Ansible is working. Inside Ansible there are roles and tasks which divide the execution of the playbooks in a structured format so that writing complex playbooks is easier.

The second part regarding version control of the code is required because there are multiple people in a team making multiple changes to the Infra as Code variables. So for example one person could be adding a firewall rule subnet to one firewall and another person could be adding a firewall rule to another firewall. So if you imagine that all the firewall rule subnets are actually present in one variables file for all the firewalls then you need version control to coordinate these two changes to the file. This version control is done by Git and Bitbucket mostly and these are the two famous tools to maintain software code versioning.

This is definitely similar to what any large software system build and maintenance would require where multiple software developers are writing code and changing code in all sorts of code files at the same time so you need a version control system to maintain consistency. These have push and pull mechanism where when you make a change locally and push it onto the main file and at the master file you pull the change. It also has peer review mechanisms where other team members can review your code differences before they allow your code to enter the main repository.

To conclude, imagine you have 30 Azure VNETs (Network VRFs) and 30 Azure Firewalls in your product deployment. As people ask you regularly to make firewall rule changes and add and delete subnets it requires either manually going to the Azure web portal and making changes their or you could use Infrastructure as Code and make the changes via Ansible playbooks and git/bitbucket variable files.

This post will cover multicloud networking integration between multiple public clouds and on prem network. Imagine four clouds three being AWS, Azure and GCP and the fourth being the on prem private cloud which is basically a Data Center network.

All these four clouds will be glued together somehow and that glueing will be the multicloud scenario. The basic requirements would be to have switching, routing, firewalling and load balancing equipment present within the glueing network between the four clouds.

Switching would be present to trunk layer 2 between IP endpoints. Routing and routing protocols like BGP would be there to exchange the IP endpoints reachability information to populate routing tables and get the Nexthops.

IP planning would be involved in the sense that the On Prem and the three public clouds dont have duplicate conflicting IP address spaces and there aren’t two endpoints in the network which are generating packets with the same source IP address.

In essence if there a single routing table present in your environement which has routes for all three Public cloud endpoint subnets and also the routes for the on prem DC network then you have multicloud established.

Wherever this routing table exists from that location there will be Layer 2 swithcing links and trunks into the three clouds and On-Prem until the trunks reach the other routing tables within the clouds, be it Azure VNET routing table, AWS/GCP VPC routing tables or On-Prem DC Routing Tables.

This multi-cloud environment is somewhat similar to large Service Provider public internet networks we are all familiar with where each large SP can be considered a cloud in itself with routes being exchange with the other large SP i.e. similar to cloud routes over BGP.

The SP environment are mostly used for traffic passing through whereas in the multi-cloud enterprise environemnt there are Data Sources and Data Sinks in either the On-Prem or in the Public Clouds. There is also the difference that the glueing network in the middle will have firewalling too.

Lets say there is a new connection required to a VPC subnet in a AWS region. Firstly the layer 2 would be provisioned over the AWS Direct Connect either directly with AWS or with partners like Megaport. For the majority of the cases the on-prem device which connects to the direct connect service will be provisioned with a new VLAN.

Once this is done this layer 2 will be trunked to the on prem device where IP endpoint is provisioned and the routing table exists. This could be a firewall or a router. This is where the packets will decide on the next hops.

On-Prem firewall filtering is in the path where the different DMZ regions, different IP Subnets and L4 Ports are allowed or disallowed to communicate with each other. If the On-Prem device with the routing table containing the multi cloud routes is a firewall things are simpler in the sense that the firewall filters are present on the same device and the different clouds are treated as different DMZ zones.

This multicloud networking scenario is a routing environment which has multiple routing domains as spokes linked via a hub site. This hub site is the on-prem glueing routing table. There would be the addition of firewalling capability within this environment so as to be able to govern and allow/disallow traffic between these environments. Another addition could be a load balancer within the glueing on-prem environment.

This load balancer would spray traffic onto either on-prem DC subnets IP endpoint servers or onto the public cloud subnets housing cloud servers. This would mean that there will be public facing IPs which receive the traffic which is natted onto Private IPs and then it is loadbalanced onto the multiple server endpoints be it in Public clouds or in On-Prem DC.

So the load balancer would have the load balanced front end IP to Server IP bindings going towards either a public cloud endpoint or an on-prem endpoint. This would mean that the load balancer connects to the glueing routing table entity as well to send/receive traffic to server IPs.

This mix of route, switch, firewall, load balancer is an example of a typical multicloud network connecting multiple public clouds.

Habib

As fresh Pakistani engineers start leaving their country on Washington Accord visas one wonders whether back home Digital Policies are being framed which could be sealing their jobless fates.

Let’s check the numbers. If half of Pakistanis generate only 5 MB of Data in one day on government run Digital Pakistan then it would amount to 500 million MB in a Day. This is half petabytes per day. This will only keep growing. All this data, it’s processing and it’s related networking will possibly be run on equipment which will only add to the import bill if Pakistan doesn’t manufacture it’s own servers. It would also traverse imported networking routers and switches which would add to the import bills if Pakistan doesn’t manufacture it’s own network equipment. All of these would also be put in Data centers which could be using Racks and Cabling possibly all imported.

How many jobs will imported servers, imported switches, imported routers, imported racks, imported DC HVAC and imported Data Center cabling produce ? And what will be the import bill of these Digital Pakistan backend items ?

Another aspect of these imported items is their lack of Cybersecurity from a National Security perspective. If it’s imported and all plug and configure only with unknown hardware and unknown software it will be considered a black box and totally insecure in terms of Cybersecurity.

A further aspect of these imported items is that each item comes with support contracts in case they fail and have a problem. These are very expensive support agreements with their manufacturers and will add to running cost and yearly import bills.

Now consider that a while back the aeronautical complex in Risalpur launched its own tablet, the PAC PAD Takhti 7. https://en.m.wikipedia.org/wiki/PAC-PAD_Takhti_7. How did that happen and why can’t we make our own Digital Pakistan equipment. How is it possible that Pakistan can make parts of JF-17 thunder and indigenously manufacture multiple types of missiles and also make a nuclear bomb but not make it’s own servers, routers, switches, DC HVAC and DC Cabling ?

Much of these IT equipments are now open sourced. Servers, Routers and switches under OCP and there is MIPSOpen and multiple open source Network Operating Systems. Positive results are really possible in case solid effort is made for local manufacturing.  At least Cybersecurity mandates that the Hardware assembly and Software assembly and their System Integration is carried out within Pakistan. This will create Jobs and reduce the import bills too.

Let’s hope for the best.

This post seeks to distinguish between the multiple aspects and phases of networking projects. Network Architecture and Network Design are the phases of a networking project carried out first. Then comes the Project Implementation phase along with configurations by Network Engineers.

Some experts have included an Analysis phase as part of, or before, the Network Architecture phase. The concepts being that first an analysis needs to be done on the flows expected from the new network.

Before Network Architecture the Analysis phase consists of gathering the User Requirements, Application Requirements, Application Types, Performance Requirements, Bandwidth Requirements, Delay Requirements etc. After gathering these requirements a Customer Requirements Document (CRD) can be made consisting of all the expectations and requirements from the network. This document will assist with project management throughout the network life cycle and for sufficiently large projects its a good exercise.

Once the requirements are gathered a Flow Analysis can be done to identify the flows required from the network. Data Source and Data Sinks, Critical Flows and per Application flows etc. are analyzed as part of Flow Analysis exercise.

Once the requirements are known and flows are known this can lead to decisions regarding the Network Architecture. The Network Architecture term is generally used with the Network Design term as one but according to one definition it is distinguished from Network Design such that the Architecture consists of the technological architecture while the design consists of specific networking devices selected and vendors selected for the architecture to be implemented on ground. This means, for example, that the Network Architecture will deal with whether to use OSPF or ISIS and how to use them and the Network Design will cover which specific vendor router to use. They are closely linked.

Once the flows are known it can be discussed what the architecture can be. This will consist of primarily deciding the protocols, the addressing and the routing architecture which can be used to facilitate the required flows. Once it is decided which network technologies to use for the flows (such as OSPF, ISIS, MPLS, L2VPN, L3VPN, IPSec, BGP, Public Internet, VXLAN, EVPN, Ethernet etc) a diagram can be made of the architecture. Multiple iterations and permutation of the various architectures will come forward from the discussions over what the architecture could be to facilitate all the flows and provide a resilient network. For each of the protocols listed above, and any other to be used, the clogs available in each can be discussed in detail. It can be discussed and decided regarding how the combinations of multiple protocols will be used to meet all the flows and meet the requirements from the network. If there are cloud connectivity requirements it will be discussed how (which protocol) and where to connect to the cloud. Once an architecture is decided and protocols are selected and the tools within the protocols which are to be used are listed then they can be summed up in a document and in diagrams.

After this phase comes the Design decisions phase. This is close to the architecture phase but this is where the vendor of that OSPF router is selected. This is where the specific router is selected from the multiple router offerings available from the selected vendor. Device vendor selection and specific device selection is a task of its own and is a separate effort in networking projects.

Also as part of the Design it will also be decided which Service Provider to use for Internet and WAN links. It will be decided which service offering will be used from the SP Vendor. If the application and system contain Public Cloud use (including Hybrid On-Prem) than it will be decided which specific connectivity mechanism and location the cloud will connect to. Will it be IPSec over Internet or over Direct Connect and where and how. Will it be the biggest MPLS VPN provider on the market or the smaller one. Will it be the biggest BGP Internet Transit provider or the smaller one.

Once the requirements are known; Once the flows are knows ; Once protocols and architecture is known ; Once the device vendors and device type and SP offerings are known and once all of these are selected than comes the implementation phase.

Engineering is a broad term which can encompass all of the above and more but as things stand here we can say that a Network Engineer as part of the engineering phase will configure and deploy the devices, configure and deploy the WAN links, configure and deploy the Internet links, configure and deploy the cloud connectivity VPNs and configure and deploy the interconnections in the network. This network engineering implementation effort is after the Requirements/Flows/Arch/Design phase as its an effort on ground and on site to implement the network and make things run. Up until this phase all the previous phases were on paper and this one is on ground practical work.

The previous Requirements/Flow/Protocols Architecture/Design and even initial aspects of the engineering phase can be done in office in meeting rooms. Initial aspects of engineering phase consisting of configurations and parameters to be used can be also decided before going out in the field. Once on ground and on site implementation starts than this is an effort of its own and can be considered as Project Deployment and Project Implementation. It entails device delivery, WAN link delivery, device power on, WAN link testing, Internet Link testing, Cloud VPN delivery, configurations and testing etc. This is a phase of its own and is an effort which is more akin to technical project management as well as it is more of an on ground project coordination and project management effort too. This is because of its physical, geographical and on site implementation aspects.

Depending on the type of project the implementation phase can consist of outage windows and maintenance windows and a lot of coordination to implement the new devices and new links.

Hence we can say that a networking project consists of separate requirements gathering, flows analysis, architecture, design and implementation phases. This means that a networking project can be divided into smaller multiple projects each consisting of these above phases. Each phase also requires a skill of its own. For example the Requirements, Flow Analysis, Architecture and Design phases are generally handled by Network Architects, Solution Architects and Network Design Engineers. The configuration and deployments aspect is handled more by Network Engineers and the Project implementation and coordination efforts are handled by Project Managers.

Multiple and simultaneously such large scale projects having all these phases going on at various levels would be run under a Program given the size of the organization is sufficiently large and that there are multiple streams of such projects being carried out.

I hope you enjoyed the good read.

Happy networking.

Habib

In Networking, Cybersecurity should now be Distributed Cybersecurity. Distributed Cybersecurity is where each organization and each home is independently capable of monitoring its traffic and inspecting it on demand.

This paradigm is where the Wifi AP at home is capable of sending its traffic in-out Data to somewhere within home and where the AP’s and Routers in an office setup are capable of sending their in-out information to somewhere within the local office.

This requires Netflow capable Wifi Access Points and Netflow capable CE Edge routers. It also requires a setup of a Netflow receiving station where the information sent by Wifi Access Points and Cisco routers in Netflow language will be received and displayed.

In Segment Routing the concept of source routing is present:


In computer networking, source routing, also called path addressing, allows a sender of a packet to partially or completely specify the route the packet takes through the network. In contrast, in conventional routing, routers in the network determine the path incrementally based on the packet’s destination

Wikipedia

In prevalent IP networks per-hop lookup is performed based on the single primary destination address in the packet. Consider a situation where a stack of IP addresses is present per packet and needs to be processed by the intermediate routers. There would be a requirement from the hardware in the line cards. In this hypothetical situation how deep of a stack of addresses can be processed by the router chipsets and hardware ?

Similarly the SID Depth or the Maximum SID Depth is a parameter in segment routing enabled network devices. To route from a ingress to an egress the path selected by the source should be entirely capable of handling the number of SIDs ( MPLS Labels in SR MPLS) that are pushed onto the packet. Because the path selected by the source is in effect translated into a stack of labels (in SR MPLS) therefore the number of labels that the each device in the path can handle is an important design consideration.

Also, in Segment Routing MPLS the SID i.e. the labels are distributed via the IGP. So an end to end path label stack is supposed to be either in a single IGP area or if multiple routing areas or domains are required then some tricks are required to push and handle a label that is not distributed by the IGP. Lets see now: An external entity will need to program the ingress source node to push a stack of labels which includes a label not distributed by the IGP. This being the source there will be a resultant intermediate destination where at some point on some hop a label will be popped and the next label will be not have been learnt via the IGP.

In some way the burden of end-to-end connectivity over multiple hops is being shifted from the distributed IP routing control plane into a central label stack distribution authority.

I wonder if where we had IP Planning and IP configurations we will have label planning and label configurations.

Shifting a portion of the intelligence present on distributed nodes to a central authority.


Information is present in computing platforms in two forms.

– Bits that are stored
– Bits that are traveling and transitioning

Securing bits that are stored and bits that are traveling and transitioning is a task.

These two forms present their own challenges but the bits that are traveling and transitioning i.e. changing forms within the computing platforms have acquired special attention. This is due to the prevalent pervasive communications using information technology computing platforms within society and businesses. When bits transition and travel they are also stored and retrieved from storage so securing both is important.

The only mystery surrounding the field of security is the presence of the all so many interaction surfaces between hardware layers and software layers through which transitions and traveling of bits occurs. From seeing text on the screen with ones eyes to thinking and considering it to thereafter editing it via hands there exists industries working within the human body which occur without us contemplating over them. There are interaction surfaces with the body as well. With muscular, neural, skeletol, etc working together to name a few.

Within computing platforms as the bits transition back and forth within one component i.e. one isolated CPU, RAM, HardDisk, Operating System and Application Software they present their own security challenge. When instead of isolation the bits travel between 2 such computing systems they present a different set of challenges. When there exists industrial scale, constant, consistent, ongoing back and forth travel and transitioning within milliseconds over large geographies between hundreds and thousands of components of various types it presents a completely different set of challenges.

Interaction surfaces are where bits change hands between subsystems. For example bits changing hands between the operating system and an application running on it or bits changing hands between one PC and another PC over a network. Interaction surface is when one subsystems surface interacts with another subsystems surface within the larger system and bits run. As the field of information technology and computing has evolved and progressed the number and types of subsystems, their surfaces and their interactions has increased a lot. So much so that securing them has become complicated. Wholesome security is therefore achieved when every time bits change hands i.e. transition and travel the interaction is secure. It is secure in the form that the storage at each end of change of hands is secure and the medium of exchange is secure.

Now it is simple to state in general english that when one subsystem interacts with another subsystem and bits change hands the storage points at each end and the medium used for the interaction and travel should be secure. Given timescale and geographical scale when it comes to reality the shear number and types of subsystems, the number and types of storage locations and the number and types of exchange mediums is so large that encompassing all of them becomes difficult.

Another incision into the security domain is cut deep into the system when the human computer interaction surface appears at various locations and in various forms. This increases the complexity of the whole security domain. Bit to Human interaction surface also needs to be kept secure at each interaction, at each geographical location and every time.

Furthermore another aspect is when one secure system under the ownership of one entity interacts with another system owned by another entity. This is therefore a time when bits are changing hands amongst different owners of them. The time and location of such an interaction surface presented between two separate ownerships also increases complexity. As your bits are stored under the ownership of another entity and accessed and retrieved by other people a whole system of management is required for such inter-ownership bit storage and bit travel interaction surfaces.

I guess a chart showing the whole variety of interaction surfaces within computing would demystify security. The reason for this is that each entry in the chart i.e. each interaction surface would be simply mapped to the precaution and action required for securing it. Each type of interaction surface would require a security precaution and actionable item within the security framework.

Be it an interaction surface where bits are:
– stored in hardware
– being processed by one set of software
– within one computer
– on a server
– in an application
– traveling over a network
– interacting with humans
– being exchanged between different humans
– being exchanged between different entities

This is a copy of a previous Linkedin Post Dated June 7 2016 which was not present on this Blog.

https://www.linkedin.com/pulse/opnfv-brahmaputra-systems-integration-nfv-vnfs-lutfullah-kakakhel/

OPNFV Brahmaputra is a Lab ready release of OPNFV. One statement is that community driven Systems Integration really is a hard task to accomplish. This becomes especially true if the systems being integrated to form a larger system are actually multiple large open source projects themselves.

To start with OPNFV aims to integrate systems upon which VNFs can be run.

The caption above is heavy. On the one side there is the requirements generating standards bodies block of organizations which produce specifications and define how the system is to run. On the other side there are the code producing development projects which produce open source projects. OPNFV stands in the middle and intends to integrate these individual code projects according to the requirements laid out by the standard bodied and provide a system on top of which VNFs can be run and tested. The reason this task is being run under an umbrella membership based organization such as OPNFV is because it is a repetitive task which every organization will need to do over and over again as soon as new releases of codes are made available for the individual projects.

It might be difficult to picture this to start with but imagine you want to have a lab ready to run and test VNFs. What is the lab composed of? It will have Infrastructure on top of which VNFs can be run. What is this Infrastructure composed of? This Infrastructure will be composed of hardware and a virtualisation layer and hypervisors and networking projects such as OpenDaylight and Openstack and KVM and Ceph all running together to provide a block of Infrastructure virtual compute network and storage (An NFVI Point of Presence) on which VNFs can be run.

Every organization which wants to reach the level of testing VNFs will need such a lab. And then what happens when a new version of OpenDaylight is released or a new version of Openstack is released or KVM or Ceph? Everybody needs to update their labs. OPNFV is a Linux Foundation project which intends to be the focal point of these activities and perform them jointly instead of everybody doing them individually.

It also helps make the system work. A patch to OpenDaylight could work well within OpenDaylight but could break things at System layer when integrating with the rest of the components which make an NFV lab (to be used to runs VNFs). OPNFV aims to be the first systems layer at which point such patches can be spotlighted and returned to the project they came from informing them that at the system levels things get disjointed.

OPNFV according to its initial white paper aims to make this systems testing environment in line with the NFV Architecture References points of Vi-Ha, Vn-Nf, Nf-Vi, Vi-VNFM & Or-Vi.

After the above is clear the figure below can be understood to be a larger system composed of individual projects integrated together with the aim of running VNFs. In the figure below OpenDaylight is one piece (in network), KVM is another piece (in compute), Openvswitch is another piece and Openstack is also one piece. All these when put together provide the infrastructure to run VNFs. Also to be noted is that in the case of OPNFV there are community labs (Pharos Labs) which provide the hardware.

The presence of this combined effort also means that for Network Operators the differentiation in the market is in Service Orchestration. The Virtual Network Functions and the Network Services run on top of them.

 

References:

http://www.etsi.org/technologies-clusters/technologies/nfv

http://www.slideshare.net/CiscoDevNet/devnet-1162-opnfv-the-foundation-for-running-your-virtual-network-functions

https://www.opnfv.org/brahmaputra

http://www.slideshare.net/OPNFV/opnf-brahmaputra-an-early-look

https://www.opnfv.org/sites/opnfv/files/pages/files/opnfv_whitepaper_092914.pdf

https://www.youtube.com/watch?v=Dh55McgHGQ8

This is a copy of a previous Linkedin Post Dated June 7 2016 which was not present on this Blog.

https://www.linkedin.com/pulse/nfv-mwc-2016-syed-habib-lutfullah-kakakhel/

ETSI showcased a practical implementation of NFV at the Mobile World Congress 2016. They showed the whole NFV Architecture being implemented and run to provide a SIP voice call. An end to end communication service of a SIP call was made based on a vIMS platform. This vIMS is an NFV VNF orchestrated by a NFV Orchestrator run on top of Infrastructure controlled by an Openstack based VIM. Let’s see the components and how they made the NFV based SIP voice call.

There are two NFVI PoPs (Points of Presence) or two VIMs. One is Openstack controlled and the other is controlled by openvim (part of OpenMano package). Both are controlled by the Open Mano NFVO for resource orchestration. The Service Orchestration is performed by Ubuntu’s Juju.  The launchpad of Rift.io is used as triggering mechanism for resource orchestration and service orchestration. 6wind provides the PEs showcasing corporate VPN interconnectivity. Telefonica provides the traffic generator to test the bandwidth capacity of the PE links and Metaswitch provides the VNF vIMS Clearwater for being run atop the infrastructure.

The figure below shows details:

A multi-site corporation’s network is shown to be running connected via 3 PEs. One site which is connecting to PE 3 has the VNF deployed in VIM2 which is another Data Center. One NFVI PoP labelled VIM 1 is hosting the 6wind PEs while the second NFVI labelled VIM 2 is hosting the VNF. There is interDC communications going on between the two NFVI PoPs. The figure below shows the SIP voice calls communication logical path. The IMS protocols SIP signaling is implemented in VIM 2 in the Metaswitch Clearwater vIMS.

More details can be seen here.

ETSI’s new initiative is delivering an open source NFV Management and Orchestration software stack which is set take away attention from the MANO and turn it into a given piece of software. This puts more focus on the VNFs. The message could be that Service Orchestration using VNFs are therefore to be the focus of attention for Telco organizations.

References:

https://osm.etsi.org/

http://www.etsi.org/technologies-clusters/technologies/nfv

https://networkbuilders.intel.com/docs/E2E-Service-Instantiation-with-Open-Source-MANO.pdf

https://www.youtube.com/watch?v=JJlxwJStkTk

This is a copy of a previous Linkedin Post Dated June 7 2016 which was not present on this Blog.

https://www.linkedin.com/pulse/nfv-telco-vepc-solutions-syed-habib-lutfullah-kakakhel/

In telecom networks the option to place an LTE vEPC stands out as an exemplary demonstration of NFV’s application. The figure below gives the generic NFV Architecture. It be divided into 3 main sections:

  1. The Management and Orchestration
    1. Consists of NFV Orchestrator, VNF Manager and Virtualization Infrastructure Manager.
  2. The NFVI – NFV Infrastructure
    1. Consists of Hardware, Virtualization Layer and Compute, Storage, Network Virtualization Software
  3. The VNFs – i.e. the virtual network functions.

The type of function the VNF provides shows what this NFV network delivers. That is if the NFV network delivers as a Network Service an LTE Core end to end communication then there will be EPC functionality implemented and provided by the VNF part of the NFV network.

See the figure below from an ETSI Proof of Concept work.

It shows the vendor CYAN providing NFV Orchestrator (NFVO) and VNF Manager (VNFM). Redhat and Openstack provide the Virtualized Infrastructure Manager (VIM). The figure also shows the relevant Infrastructure hardware and hypervisor software solutions. Finally it shows the VNFs as being Connectem’s vEPC. If the VNF was implementing a different functionality say it was a vIMS then the rest of the components in the figure could be the same and the end to end Network Service being provided by the NFV network would be different. Therefore the function implemented inside the VNF defines what service the NFV network provides. Therefore the work done by the VNF decides whether your NFV network is Telco or Enterprise; LTE or WiMAX; LTE or 3G etc.

For a list of possible Telco VNF’s see the figure below.

Every chuck of functional blocks could be implemented together as a VNF. So what Connectem is doing is implementing the LTE MME, SGW, PGW, HSS, PCRF functionality and packaging it as a VNF which can be run atop virtualized infrastructure. In the ETSI Proof of Concept, Connectems solution therefore does this according the NFV specifications so that the VNF can be managed by a VNFM and its infrastructure is composed so as it can be managed by a VIM and all this can be controlled and coordinated by an NFVO i.e. NFV Orchestrator. Therefore you get an LTE EPC functionality inside the virtualized NFV environment.

The ETSI definition of a VNF is that a VNF is “a Network Function capable of running on NFV Infrastructure (NFVI) and being orchestrated by a NFV Orchestrator (NFVO) and VNF Manager”.

Coming back to our vEPC example the VNF has components. These VNF Components (VNFCs) can logically be pictured as below:

ETSI mandates that a vendor can choose to implement components as they wish inside the VNF environment as long as they speak to the other NFV architecture components as per their defined VNF interfaces. This means that the different components can utilize efficient compute storage and networking procedures instead of the standards body defined communication methodology. An example is that inside the vEPC software the MME will communicate with the SGW but will utilize efficient computational methodology instead of the 3GPP defined interfaces. If for some reason (say a lab environment) a vendor chooses to implement the 3GPP interfaces inside their vEPC it won’t be as fast and as efficient but it can be used to showcase 3GPP communications inside NFV.

Good VNFCs software design is what will distinguish different providers of vEPC software solutions.

References:

http://www.etsi.org/technologies-clusters/technologies/nfv

https://www.opennetworking.org/images/stories/sdn-solution-showcase/germany2015/CENGN%20-%20NFV-based%20LTE%20Core%20in%20the%20Cloud.pdf

http://nfvwiki.etsi.org/images/NFVPER%2814%29000010r2_NFV_ISG_PoC_Proposal-E2E_vEPC_Orchestration.pdf

This is a copy of a previous Linkedin Post Dated May 16 2016 which was not present on this Blog.

https://www.linkedin.com/pulse/nfv-mano-management-orchestration-syed-habib-lutfullah-kakakhel/

MANO is the brain of the NFV Network. It is the part of the network through which control operations are performed on virtual network functions and virtual network functions infrastructure.

One set of v-eNB, vMME, vSGW, vPGW, vPCRF can be assumed to be a Network Service. Each of the above v’s provide distinct Network Functions which with the v’s are deployed as Virtual Network Functions on Virtual Network Functions Infrastructure. The Virtual Network Functions Infrastructure is hardware with the virtual abstraction layer providing virtualization. These are the acronyms.

Multiple virtual network functions are connected together, or chained together, to provide a network service. The physical links are in the infrastructure which is the compute/storage hardware equipment while the logical links are among the VNFs. The endpoint is the Network Service endpoint which is providing service to the end devices.  Between the physical links and the logical links sits the virtualization layer.

The NFVO i.e. the NFV Orchestrator is the part of the network which controls the deployment and operations of virtual network functions.

This is a copy of a previous Linkedin Post Dated May 24 2016 which was not present on this Blog.

https://www.linkedin.com/pulse/nfv-independance-from-hardware-lock-in-lutfullah-kakakhel/

NFV is simple. It’s most simplistic distinction is that it is the Telecom Operators name for hardware independence and software dependence. Hardware is locked in while software is more easily changed (a project manager would say: relative to hardware that is).

We can try to see what problem NFV seeks to solve.

Telecom operators faced a dilemma about hardware. “To launch a new network service often requires yet another variety (of hardware) and finding the space and power to accommodate these boxes is becoming increasingly difficult; compounded by the increasing costs of energy, capital investment challenges and the rarity of skills necessary to design, integrate and operate increasingly complex hardware-based appliances.”

The sentence starts with “to launch a new network service often requires yet another variety” (of hardware). Remember they want to compete with the Whatsapp’s and Viber’s of tomorrow and need agility of deployment.

NFV seeks to provide that ‘Agility of Deployment’ of new network services to Network Operators by taking away dependency on proprietary and vendor locked in hardware. That is the high level purpose.

The rest is architecture. Hardware can be any compute(r) node with associated storage (types) and an accompanied (inter)network of such devices.  Then it follows to make virtual services; Virtual Network Functions with Virtual Network Infrastructure.

To roll out software or new software for a new service is easier than to roll out hardware.

Another primary benefit is elasticity in energy consumption. Energy consumption according to demand. With more control of hardware, which is the energy consuming physical device, via dependence on software this is made possible.

Providing Layer 2 VPN and Layer 3 VPN services has been a requirement of enterprises from Service Providers. Similarly Data Center networks need to provide Layer 2/3 Overlay facility to applications being hosted.

EVPN is a new control plane protocol to achieve the above . This means it coordinates the distribution of IP and MAC addresses of endpoints over another network. This means it is has its own protocol messages to provide endpoint network addresses distribution mechanism. In the Data Plane traffic will be switched via MPLS Labels next hop lookups or IP next hop lookups.

To provide for a new control plane with new protocol messages providing new features BGP has been used. So it is BGP Update messages which are used as the carrier for EVPN messages. BGP connectivity is first established and messages are exchanged. The messages exchanged will be using BGP and in them EVPN specific information will be exchanged.

The Physical layer topology can be a leaf spine DC Clos fabric of a simple Distribution/Core setup. The links between the nodes will be Ethernet links.

One aspect of EVPN is that the terms Underlay and Overlay are now used. Underlay represent the underlying protocols on top of which EVPN runs. These are the IGP (OSPF,ISIS or BGP), and MPLS (LDP/SR).  The underlay also includes the Physical Clos or Core/Distribution topology which has high redundancy built into it using fabric links and LACP/LAGs. The Overlay is the BGP EVPN vitual topology itself which uses the underly network to build a virtual network on top. It is the part of the network which related to providing tenant or vpn endpoints reachability. i.e. MAC address or VPN IP distribution.

It’s a new protocol and if you look at the previous protocols there is little mechanism to provide all active multihoming capability. This refers to one CE being connected via two links to two PEs and both links being active and providing traffic path to far end via ECMP and Multipathing. 2 Chassis multichassis lag has been one option for but it is proprietary per vendor and causes particular virtual chassis link requirement limits. Ingress PE to multiple egress PE per flow based load balancing using BGP multipathing is also newly enabled by EVPN.

There is also little mechanism in previous generation protocols to provide efficient fabric bandwidth utilization for tenant/private networks over meshed-style links. Previous protocols provide single active and single paths and required LDP sessions and tunnels for full mesh over a fabric. MAC learning in BGP over underlay provides this in EVPN.

Similarly there is no mechanism to provide workload (VM) placement flexibility and mobility across a fabric. EVPN provides this via Distributed Anycast Gateway.

 

Two edge computing specifications are present.

Facebook OCP’s CG-OpenRack-19 and LinkedIn’s Open19.

They provide for Rack Layouts, Compute, Storage and Networking.

Networking for CG-OpenRack-19 is copied below. The servers sleds in the pictures appear to be single homed as per the colors. It would be interesting find out which protocol handles the Active Active state of the multi homed Compute and Storage Sleds if that is at all present.

OpenRack-19

Given here

Open19 gives 100G bandwidth capabilities and some details are on its website.

5G’s edge ultra low latency requirements would could require edge solutions and it would be interesting to see how things play out ahead.

This also brings to mind SD-WAN because these edge racks will be at least connected in a large WAN.

Google’s B4 is one of its software defined inter data denter WAN solution. Google’s Espresso is its peering edge solution. Espresso links into B4 domain via B2. This link has the details of Espresso as shared by the Google team.

 

Google-Espresso-B4.JPG

Google is not employing an army of networking engineers to run these because they are software defined and programmed bots will probably be doing operational tasks. To operate this network there are Site Reliability Engineers though.

Here is one public job advertisement that relates as to what an SRE is expected to be like:

We have reliable infrastructure and can spin up new environments in a couple of hours. Automate everything so there is more time for exploring and learning. Foster the DevOps mindset

What are our goals?
  Internationalisation
  Deploying multiple data centers
  Deploying every 5 minutes
Requirements
  Experience with Java or JavaScript in a Dockerised environment
  Linux Engineering/Administration
  Desire for improving processes
  Have a passion and most importantly, a sense of humour
Tech Stack (you DO NOT need experience in all of these)
  Kubernetes + Docker
  Terraform + Ansible
  Linux
  Kotlin + NodeJS
  ELK stack
  AWS

This is obviously an SRE for the servers side and the application enablement side of things. If there is a large software defined edge network like Espresso and a large Edge-to-DC network like B2 and a large software defined inter-DC network like B4 you will need a different SRE.

Here is Google’s version of a Site Reliability Engineer Job.

Job description
Minimum Qualifications

BS degree in Computer Science or related technical field involving coding (e.g. physics or mathematics), or equivalent practical experience.
3 years of experience working with algorithms, data structures, complexity analysis and software design.
Experience in one or more of the following: C, C++, Java, Python, Go, Perl or Ruby.

Preferred Qualifications

Systematic problem-solving approach, coupled with effective communication skills and a sense of ownership and drive.
Interest in designing, analyzing and troubleshooting large-scale distributed systems.
Ability to debug and optimize code and automate routine tasks.

About The Job

Hope is not a strategy. Engineering solutions to design, build, and maintain efficient large-scale systems is a true strategy, and a good one.

Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures that Google’s services—both our internally critical and our externally-visible systems—have reliability and uptime appropriate to users’ needs and a fast rate of improvement while keeping an ever-watchful eye on capacity and performance.

SRE is also a mindset and a set of engineering approaches to running better production systems—we build our own creative engineering solutions to operations problems. Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation. As SREs are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to solve a broad spectrum of problems. Practices such as limiting time spent on operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting and dynamic day-to-day work.

We can see that Google’s SRE Job Ad is all software along with large scale distributed systems requirements.

Now if we note this extract from the Wikipedia SD-WAN article:

“With a global view of network status, a controller that manages SD-WAN can perform careful and adaptive traffic engineering by assigning new transfer requests according to current usage of resources (links). For example, this can be achieved by performing central calculation of transmission rates at the controller and rate-limiting at the senders (end-points) according to such rates”

and we also note this extract:

“As there is no standard algorithm for SD-WAN controllers, device manufacturers each use their own proprietary algorithm in the transmission of data. These algorithms determine which traffic to direct over which link and when to switch traffic from one link to another. Given the breadth of options available in relation to both software and hardware SD-WAN control solutions, it’s imperative they be tested and validated under real-world conditions within a lab setting prior to deployment.”

We see Algorithms.

Its clear that there are different algorithms running these Software Defined networks (Google’s software defined Espresso, B2, B4 and Jupiter). These algorithms automate, kick in and optimize. Google becomes a large scale distributed system with various algorithms here and there. While Software Architects and Software Engineers will have developed these algorithmic nodes and programmed them into network devices/servers an SRE is the human who will operate the system. A team of SREs.

One aspect of Networking protocols is that they are for a multi-vendor, multi-enterprise and multi-domain environments. They provide simple consensus to connect two or more different network devices.

To take a merchant silicon network device like OCP’s Wedge and OCP style servers and make one large network like Google out of it will require software engineering to remake the NOS (Network Operating Systems) part at least. There will be atleast a Meta-NOS, somewhat running on top of a typical NOS which would handle the SDN – software defined algorithms. In addition to the SDN controllers talking to this Meta-NOS. Multiple layers of SDN controllers will be talking to each other and you can call this a network protocol or an SDN algorithm but it will be part of distributed systems software architecture and it will be programmed in place by software engineers.

Large Scale Distributed System on Merchant Silicon Hardware – Software Defined Meta-NOS – SDN Controllers – Hierarchical SDN Controllers – Algorithms.

Sounds like a Program Management task instead of PMP scale Engineering Project Management task. You will need Mathematicians to sit with Network Architects, Distributed Systems Architects and Software Architects. The Mathematicians will do give the algorithms. They will be important too.

Fun times.

Terabit scale networking requires better Consensus.

Autonomous Networks and Autonomic Networking can be renamed as solving Consensus Dynamics.

Wikipedia States (Nov’ 2018):

“Consensus dynamics or agreement dynamics is an area of research lying at the intersection of systems theory and graph theory. A major topic of investigation is the agreement or consensus problem in multi-agent systems that concerns processes by which a collection of interacting agents achieve a common goal. ”

To note again it is an ‘intersection of systems theory and graph theory’.

Lets not forget that mathematically communications networks are Graphs. An OSPF/ISIS network is a weighted directed graph where the costs & metrics are the weights, the network devices are vertices and the ethernet L2 links are directed edges.

Furthermore, to note again that ‘ a collection of interacting agents achieve a common goal’. In networks the common goal can be to enable end to end, host to host connectivity over a vast network. TCP and UDP.

Interesting times ahead for Terabit scale networks. Keeps the fun alive in network engineering.

References:

https://en.wikipedia.org/wiki/Consensus_dynamics

Click to access EncyclopAI07.final.pdf