Archive

Site Reliability Engineering

Two edge computing specifications are present.

Facebook OCP’s CG-OpenRack-19 and LinkedIn’s Open19.

They provide for Rack Layouts, Compute, Storage and Networking.

Networking for CG-OpenRack-19 is copied below. The servers sleds in the pictures appear to be single homed as per the colors. It would be interesting find out which protocol handles the Active Active state of the multi homed Compute and Storage Sleds if that is at all present.

OpenRack-19

Given here

Open19 gives 100G bandwidth capabilities and some details are on its website.

5G’s edge ultra low latency requirements would could require edge solutions and it would be interesting to see how things play out ahead.

This also brings to mind SD-WAN because these edge racks will be at least connected in a large WAN.

Google’s B4 is one of its software defined inter data denter WAN solution. Google’s Espresso is its peering edge solution. Espresso links into B4 domain via B2. This link has the details of Espresso as shared by the Google team.

 

Google-Espresso-B4.JPG

Google is not employing an army of networking engineers to run these because they are software defined and programmed bots will probably be doing operational tasks. To operate this network there are Site Reliability Engineers though.

Here is one public job advertisement that relates as to what an SRE is expected to be like:

We have reliable infrastructure and can spin up new environments in a couple of hours. Automate everything so there is more time for exploring and learning. Foster the DevOps mindset

What are our goals?
  Internationalisation
  Deploying multiple data centers
  Deploying every 5 minutes
Requirements
  Experience with Java or JavaScript in a Dockerised environment
  Linux Engineering/Administration
  Desire for improving processes
  Have a passion and most importantly, a sense of humour
Tech Stack (you DO NOT need experience in all of these)
  Kubernetes + Docker
  Terraform + Ansible
  Linux
  Kotlin + NodeJS
  ELK stack
  AWS

This is obviously an SRE for the servers side and the application enablement side of things. If there is a large software defined edge network like Espresso and a large Edge-to-DC network like B2 and a large software defined inter-DC network like B4 you will need a different SRE.

Here is Google’s version of a Site Reliability Engineer Job.

Job description
Minimum Qualifications

BS degree in Computer Science or related technical field involving coding (e.g. physics or mathematics), or equivalent practical experience.
3 years of experience working with algorithms, data structures, complexity analysis and software design.
Experience in one or more of the following: C, C++, Java, Python, Go, Perl or Ruby.

Preferred Qualifications

Systematic problem-solving approach, coupled with effective communication skills and a sense of ownership and drive.
Interest in designing, analyzing and troubleshooting large-scale distributed systems.
Ability to debug and optimize code and automate routine tasks.

About The Job

Hope is not a strategy. Engineering solutions to design, build, and maintain efficient large-scale systems is a true strategy, and a good one.

Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures that Google’s services—both our internally critical and our externally-visible systems—have reliability and uptime appropriate to users’ needs and a fast rate of improvement while keeping an ever-watchful eye on capacity and performance.

SRE is also a mindset and a set of engineering approaches to running better production systems—we build our own creative engineering solutions to operations problems. Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation. As SREs are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to solve a broad spectrum of problems. Practices such as limiting time spent on operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting and dynamic day-to-day work.

We can see that Google’s SRE Job Ad is all software along with large scale distributed systems requirements.

Now if we note this extract from the Wikipedia SD-WAN article:

“With a global view of network status, a controller that manages SD-WAN can perform careful and adaptive traffic engineering by assigning new transfer requests according to current usage of resources (links). For example, this can be achieved by performing central calculation of transmission rates at the controller and rate-limiting at the senders (end-points) according to such rates”

and we also note this extract:

“As there is no standard algorithm for SD-WAN controllers, device manufacturers each use their own proprietary algorithm in the transmission of data. These algorithms determine which traffic to direct over which link and when to switch traffic from one link to another. Given the breadth of options available in relation to both software and hardware SD-WAN control solutions, it’s imperative they be tested and validated under real-world conditions within a lab setting prior to deployment.”

We see Algorithms.

Its clear that there are different algorithms running these Software Defined networks (Google’s software defined Espresso, B2, B4 and Jupiter). These algorithms automate, kick in and optimize. Google becomes a large scale distributed system with various algorithms here and there. While Software Architects and Software Engineers will have developed these algorithmic nodes and programmed them into network devices/servers an SRE is the human who will operate the system. A team of SREs.

One aspect of Networking protocols is that they are for a multi-vendor, multi-enterprise and multi-domain environments. They provide simple consensus to connect two or more different network devices.

To take a merchant silicon network device like OCP’s Wedge and OCP style servers and make one large network like Google out of it will require software engineering to remake the NOS (Network Operating Systems) part at least. There will be atleast a Meta-NOS, somewhat running on top of a typical NOS which would handle the SDN – software defined algorithms. In addition to the SDN controllers talking to this Meta-NOS. Multiple layers of SDN controllers will be talking to each other and you can call this a network protocol or an SDN algorithm but it will be part of distributed systems software architecture and it will be programmed in place by software engineers.

Large Scale Distributed System on Merchant Silicon Hardware – Software Defined Meta-NOS – SDN Controllers – Hierarchical SDN Controllers – Algorithms.

Sounds like a Program Management task instead of PMP scale Engineering Project Management task. You will need Mathematicians to sit with Network Architects, Distributed Systems Architects and Software Architects. The Mathematicians will do give the algorithms. They will be important too.

Fun times.

Event Driven Network Automation is a term used to describe what large scale NetOps teams are doing to scale, deploy and manage networking infrastructure.

YAML data formatting and Jinja2 templating with Python glueing and executing.

Ansible/YAML and Netconf/API for configuration, execution operations.

Event Generation using SNMP/Telemetry/BGPMon.

BGPMon looks like it could be used to check up on changes in a Routed Core with BGP based Leaf-Spine Clos Fabric.

Zero Touch Provisioning – ZTP is best suited for quickly bringing up new devices.

An Orchestration-style GUI layer custom made for every domain in the network would definitely be required as well for various aspects of NetOps.

There can be human driven network automation but there can also be event driven network automation which can be termed as ‘closed loop’ with rule based actions defined by humans.

The events driven, closed loop, rule-based-actions execution layer would then be managed by humans. This layer would be evolving and to manage it there would be a requirement of necessary data structuring and scripting skills in addition to being mindful of what the impact is on the network layer (DC or WAN, both).

References:

https://mirceaulinic.net/2017-10-19-event-driven-network-automation/

Click to access 17-RIPE76_-Event-driven-network-automation-and-orchestration.pdf

Network Automation: Template Configurations with Jinja2 and YAML

https://packetpushers.net/back-journey-network-automation-introduction/

https://packetpushers.net/back-journey-network-automation-part-1-zero-touch-provisioning/

https://packetpushers.net/back-journey-network-automation-part-2-ansible/

https://www.ipspace.net/Building_Network_Automation_Solutions

What does a Site Reliability Engineer do?

 

Site Reliability Engineer is a term for the operations and administration of complex computer systems involving:

 

  • Networking,
  • Virtualized operating system environments including VM/Containers,
  • The orchestrations tools for Networking/Virtualized infrastructure,
  • Applications,
  • The interactions of the above in a multi-site/multi-pop environment,
  • And utilising the above to deliver a service/product and ensuring it is working well.

 

It basically appears to be an operations role but within a complex environment where multiple technology silos interact heavily to deliver the product to the end user. Google hires and assigns the role of Site Reliability Engineer and they are operating such a complex environment delivering Google.com/Gmail.com/Youtube.com etc. Facebook does the same.

 

Looking at a particular set of Site Reliability Engineer job advertisements they appear to have one thing in common for diverse roles within the SRE domain:

 

  • ‘You have an ‘infrastructure as code’ approach to managing infrastructure’

 

So from the Site Reliability Engineer title we reach to the term Infrastructure as a Code approach.

 

Infrastructure as a Code ‘tools’ sort of sit on top of Configurations Management tools like Puppet, Chef, Ansible and provide increased functionality. Terraform and AWS Cloudformation are two Infrastructure as a Code tools but what the Job Ads are asking for is it approach.

 

Coding is common:

  • One thing that is apparent is that when you take a look at an Ansible Playbook YAML .yml file or a Teraform Configuration .tf file or a Chef Recipe .rb file or a Puppet Manifest .pp file or a AWS Cloudformation Template they all look code-like. In fact, they are all code but at a plane where the code is not intended to utilize processor, memory and hard disk of a single machine in a setup.exe resulting file-form to deploy on a single computer. They are coded or code-like data expressions which translate into the deployment, configuration and orchestration of more complicated computing systems. They are code-like expression which for example deploy AWS products which are themselves ‘infrastructure as a service’ public cloud systems.

 

From this it appears that Infrastructure as a Code is a term signifying another layer of abstraction.

There are levels of abstraction. Where the levels can be :

  • solid state physics, silicon/CPU, memory, hard disk hardware.
  • then 0’s and 1’s & bits on top of these at the next level
  • then integers & strings utilizing the above layer
  • then arithmetic operations and string manipulations on top of above
  • then programs and software applications running on computers/devices,
  • then interconnected computers
  • then distributed systems composed of interconnected computers/devices
  • then Public Cloud Infrastructure as a Service
  • then an Application running on this silicon,cpu,memory,hardisk,0’s,1’s, bits, integers, strings, arithmetic operations, string manipulations, individual programs/software, interconnected infrastructure/computers/servers/routers/switches, public cloud.

By inference Infrastructure as a Code approach presents/preserves information that is relevant to our end application plane and environment and abstracts information that is not relevant in our Application’s environment.

The internet is a good examples of multiple systems on top of other systems.

It looks like even Google and Facebook still need a human operate their systems. They will not be flying through them like an aeroplane in clouds. They will know the layers and the systems in place. They will navigate from symptoms to root cause and then codify/rectify & adjust for continual optimal service.

Moving on, three job ads for Site Reliability Engineer are given below.

Infrastructure as Code is common. Lets see the rest.

Job Ad:

  1. Site Reliability Engineer | Data Stores | Redis & Kafka

 

Our Tech Stack across Site Reliability as a whole:

• Data Analytics software including Kafka and Redis
• Open Source technologies (We constantly look to innovate and adopt)
• Amazon Web Services – AWS, and a load of services
• Coding with React, NodeJS and Python
• Couchbase, Kubernetes, ElasticSearch & Microservices Infrastructure
• Linux Operating systems, we look for passion
• Infrastructure as Code & Automate everything are a couple of our mottos

Job Ad:

  1. Site Reliability Engineers | Multiple Roles | Golang | AWS | ReactThe TechStack you will be getting your hands dirty with:

    • Open Source technologies (We constantly look to innovate and adopt)
    • Amazon Web Services – AWS, and a load of services
    • Coding with React, NodeJS and Python
    • Couchbase, Kubernetes, ElasticSearch & Microservices Infrastructure
    • Linux Operating systems, we look for passion
    • Infrastructure as Code & Automate everything are a couple of our mottos

 

Job Ad:

  1. Site Reliability Engineer | Edge Computing | AWS | Networking

 

Our Tech Stack across Site Reliability as a whole:

• Networking – Load balancers, Proxies, Routing, DC, AWS
• Open Source technologies (We constantly look to innovate and adopt)
• Amazon Web Services – AWS, and a load of services
• Coding with React, NodeJS and Python
• Couchbase, Kubernetes, ElasticSearch & Microservices Infrastructure
• Linux Operating systems, we look for passion
• Infrastructure as Code & Automate everything are a couple of our mottos

 

The three SRE roles are diverse and they are geared towards multiple parts of the stack which run the end application. SRE tilting towards Networking, SRE tilting towards Data/Stream Processing and SRE tilting towards Development (Front-End/Back-End).

 

The below are common to all three:

 

  • React, NodeJS and Python
  • Couchbase, Kubernetes, ElasticSearch & Microservices Infrastructure
  • Linux/AWS

 

The below varies amongst them:

 

  • SRE Networking tilted role – Edge Computing, Load balancers, Proxies, Routing, DC, AWS
  • SRE Data/Stream Processing tilted role – Kafka, Redis
  • SRE Dev tilted role – Golang

 

And so what is this system achieving and what is it composed of? How do they interact and what do the multiple SREs do?

 

In terms of programming languages, we have Golang, React, NodeJS and Python.

In terms hardware we have AWS and Edge Computing PoPs/nodes/devices

In terms of data store and streaming we have Kafka and Redis

In terms of containers management there is Kubernetes

In terms of Data retrievals / search and possibly analytics there is ElasticSearch

In terms database there is Couchbase

 

An SRE is not an end-application software developer. So the above listed tools are part of the system to be run. This will be done with infrastructure as a code approach to programify for optimal operations.

 

So lets now try to put the clues in the Job Description together.

 

  • React & NodeJS are Javascript frameworks with React being the User Interface/FrontEnd (used by Facebook UI) and NodeJS being the Server/BackendEnd for Scalable Data I/O. Python can be used as for programming services at various locations. Golang is also used in the the Backend Serverside providing for its concurrency feature for applications/services.
  • Redis can be used to store application state information. In-memory fast, scalable and distributed. It is a key value store provider for application state cache-like.
  • Kafka is a distributed data streaming platform and can be stated to be in the middle. Producers producing data and consumers using data and stream processors processing it are connected to Kafka clusters. It can be used for event streaming/aggregation.
  • With no other stream processing engine present in the Job description Kafka with Kafka Streams can be stated to provide for stream processing as well.
  • ElasticSearch can be used for indexing and search. Data can be copied in via Kafka connector APIs and then indexed. Kibana is not listed but it might have skipped mentioning and can be used for the visualization and dashboarding.
  • Couchbase can be used as a NoSQL JSON-style distributed database as an external store for storage of logs/events (documents). It can take in data and deliver it via its Kafka connector.
  • Kubernetes manages the containers furbishing the application environment.

 

It looks to be a full Cloud Native environment which needs to be kept up and running optimally with continued service.

 

Part of this environment is the networking aspect.  This includes the listed edge computing component which means this high performance cloud native application also has near-user-location edge devices within its architecture.

 

Geolocation Routing and CDNs are the tools used to decrease application latency times. AWS Availability Zones can be considered as multi-site replicated PoPs. Edge networking nodes will also branch off as required and can be mini PoPs. Depending on the size of the user base being serviced by the Edge PoP node it might scale into being a small DC.

 

Branching within the networking domain is the use of Proxies. Forward + Reverse + Side-car if required.

 

A scenario of an increase in application demand resulting in container scaling which can result in requirement of on demand load balancing and proxying. One such tool is the F5 Application Services Proxy which from a networking perspective is a proxy but it integrates with the Kubernetes and can be used for an infrastructure as a coded deployment. F5’s Application Services Proxy is itself a Node.js application but is middleware here.