What does a Site Reliability Engineer do?

Site Reliability Engineer is a term for the operations and administration of complex computer systems involving:

Networking,
Virtualized operating system environments including VM/Containers,
The orchestrations tools for Networking/Virtualized infrastructure,
Applications,
The interactions of the above in a multi-site/multi-pop environment,
And utilising the above to deliver a service/product and ensuring it is working well.

It basically appears to be an operations role but within a complex environment where multiple technology silos interact heavily to deliver the product to the end user. Google hires and assigns the role of Site Reliability Engineer and they are operating such a complex environment delivering Google.com/Gmail.com/Youtube.com etc. Facebook does the same.

Looking at a particular set of Site Reliability Engineer job advertisements they appear to have one thing in common for diverse roles within the SRE domain:

‘You have an ‘infrastructure as code’ approach to managing infrastructure’

So from the Site Reliability Engineer title we reach to the term Infrastructure as a Code approach.

Infrastructure as a Code ‘tools’ sort of sit on top of Configurations Management tools like Puppet, Chef, Ansible and provide increased functionality. Terraform and AWS Cloudformation are two Infrastructure as a Code tools but what the Job Ads are asking for is it approach.

Coding is common:

One thing that is apparent is that when you take a look at an Ansible Playbook YAML .yml file or a Teraform Configuration .tf file or a Chef Recipe .rb file or a Puppet Manifest .pp file or a AWS Cloudformation Template they all look code-like. In fact, they are all code but at a plane where the code is not intended to utilize processor, memory and hard disk of a single machine in a setup.exe resulting file-form to deploy on a single computer. They are coded or code-like data expressions which translate into the deployment, configuration and orchestration of more complicated computing systems. They are code-like expression which for example deploy AWS products which are themselves ‘infrastructure as a service’ public cloud systems.

From this it appears that Infrastructure as a Code is a term signifying another layer of abstraction.

There are levels of abstraction. Where the levels can be :

solid state physics, silicon/CPU, memory, hard disk hardware.
then 0’s and 1’s & bits on top of these at the next level
then integers & strings utilizing the above layer
then arithmetic operations and string manipulations on top of above
then programs and software applications running on computers/devices,
then interconnected computers
then distributed systems composed of interconnected computers/devices
then Public Cloud Infrastructure as a Service
then an Application running on this silicon,cpu,memory,hardisk,0’s,1’s, bits, integers, strings, arithmetic operations, string manipulations, individual programs/software, interconnected infrastructure/computers/servers/routers/switches, public cloud.

By inference Infrastructure as a Code approach presents/preserves information that is relevant to our end application plane and environment and abstracts information that is not relevant in our Application’s environment.

The internet is a good examples of multiple systems on top of other systems.

It looks like even Google and Facebook still need a human operate their systems. They will not be flying through them like an aeroplane in clouds. They will know the layers and the systems in place. They will navigate from symptoms to root cause and then codify/rectify & adjust for continual optimal service.

Moving on, three job ads for Site Reliability Engineer are given below.

Infrastructure as Code is common. Lets see the rest.

Job Ad:

Site Reliability Engineer | Data Stores | Redis & Kafka

Our Tech Stack across Site Reliability as a whole:

• Data Analytics software including Kafka and Redis
• Open Source technologies (We constantly look to innovate and adopt)
• Amazon Web Services – AWS, and a load of services
• Coding with React, NodeJS and Python
• Couchbase, Kubernetes, ElasticSearch & Microservices Infrastructure
• Linux Operating systems, we look for passion
• Infrastructure as Code & Automate everything are a couple of our mottos

Job Ad:

Site Reliability Engineers | Multiple Roles | Golang | AWS | ReactThe TechStack you will be getting your hands dirty with:

• Open Source technologies (We constantly look to innovate and adopt)
• Amazon Web Services – AWS, and a load of services
• Coding with React, NodeJS and Python
• Couchbase, Kubernetes, ElasticSearch & Microservices Infrastructure
• Linux Operating systems, we look for passion
• Infrastructure as Code & Automate everything are a couple of our mottos

Job Ad:

Site Reliability Engineer | Edge Computing | AWS | Networking

Our Tech Stack across Site Reliability as a whole:

• Networking – Load balancers, Proxies, Routing, DC, AWS
• Open Source technologies (We constantly look to innovate and adopt)
• Amazon Web Services – AWS, and a load of services
• Coding with React, NodeJS and Python
• Couchbase, Kubernetes, ElasticSearch & Microservices Infrastructure
• Linux Operating systems, we look for passion
• Infrastructure as Code & Automate everything are a couple of our mottos

The three SRE roles are diverse and they are geared towards multiple parts of the stack which run the end application. SRE tilting towards Networking, SRE tilting towards Data/Stream Processing and SRE tilting towards Development (Front-End/Back-End).

The below are common to all three:

React, NodeJS and Python
Couchbase, Kubernetes, ElasticSearch & Microservices Infrastructure
Linux/AWS

The below varies amongst them:

SRE Networking tilted role – Edge Computing, Load balancers, Proxies, Routing, DC, AWS
SRE Data/Stream Processing tilted role – Kafka, Redis
SRE Dev tilted role – Golang

And so what is this system achieving and what is it composed of? How do they interact and what do the multiple SREs do?

In terms of programming languages, we have Golang, React, NodeJS and Python.

In terms hardware we have AWS and Edge Computing PoPs/nodes/devices

In terms of data store and streaming we have Kafka and Redis

In terms of containers management there is Kubernetes

In terms of Data retrievals / search and possibly analytics there is ElasticSearch

In terms database there is Couchbase

An SRE is not an end-application software developer. So the above listed tools are part of the system to be run. This will be done with infrastructure as a code approach to programify for optimal operations.

So lets now try to put the clues in the Job Description together.

React & NodeJS are Javascript frameworks with React being the User Interface/FrontEnd (used by Facebook UI) and NodeJS being the Server/BackendEnd for Scalable Data I/O. Python can be used as for programming services at various locations. Golang is also used in the the Backend Serverside providing for its concurrency feature for applications/services.
Redis can be used to store application state information. In-memory fast, scalable and distributed. It is a key value store provider for application state cache-like.
Kafka is a distributed data streaming platform and can be stated to be in the middle. Producers producing data and consumers using data and stream processors processing it are connected to Kafka clusters. It can be used for event streaming/aggregation.
With no other stream processing engine present in the Job description Kafka with Kafka Streams can be stated to provide for stream processing as well.
ElasticSearch can be used for indexing and search. Data can be copied in via Kafka connector APIs and then indexed. Kibana is not listed but it might have skipped mentioning and can be used for the visualization and dashboarding.
Couchbase can be used as a NoSQL JSON-style distributed database as an external store for storage of logs/events (documents). It can take in data and deliver it via its Kafka connector.
Kubernetes manages the containers furbishing the application environment.

It looks to be a full Cloud Native environment which needs to be kept up and running optimally with continued service.

Part of this environment is the networking aspect. This includes the listed edge computing component which means this high performance cloud native application also has near-user-location edge devices within its architecture.

Geolocation Routing and CDNs are the tools used to decrease application latency times. AWS Availability Zones can be considered as multi-site replicated PoPs. Edge networking nodes will also branch off as required and can be mini PoPs. Depending on the size of the user base being serviced by the Edge PoP node it might scale into being a small DC.

Branching within the networking domain is the use of Proxies. Forward + Reverse + Side-car if required.

A scenario of an increase in application demand resulting in container scaling which can result in requirement of on demand load balancing and proxying. One such tool is the F5 Application Services Proxy which from a networking perspective is a proxy but it integrates with the Kubernetes and can be used for an infrastructure as a coded deployment. F5’s Application Services Proxy is itself a Node.js application but is middleware here.

HTTP Compression

March 21, 2014

Layer 7

HTTP compression is a feature used to compress web pages prior to transmission over the network. The compression and decompression is done at the Web server and browser respectively. RFC 2616’s section on Content Codings provides detail on the supported compression methods with gzip being widely used.

Within the Firefox browser if we type “about:config” in the browser website address field it takes us to gzip, deflate if we search for encoding.

We can modify the value here in the browser configuration and remove gzip. Doing so will make Firefox not send gzip as a supported compression format in it’s HTTP GET request. Web servers which use gzip will then send uncompressed data.

I decided to test website load times variation with compression enabled and disabled. I shuffled the browser configuration above (removing/adding gzip) a couple of times and reloaded the same page while clearing the browser history. Simultaneously I took wireshark traces capturing the HTTP GET requests and OK responses to compare the time and byte count variations and see the changed packets.

The difference was obvious. Initially the HTTP GET request sent by Firefox contains Accept-Encoding value of gzip and deflate both. This is shown below.

The corresponding HTTP OK response contains 8 TCP Segments totaling 9821 bytes and is received within a second. This is shown below.

After making the changes to the about:config configurations and removing gzip, the values changed. The HTTP GET request no longer contains gzip under Accept-Encoding. This is shown below.

The difference seen in the HTTP OK message was big. There were 21 reassembled TCP segments totaling to 26751 bytes and the page was received in more time. This can be seen below.

If we compare the Content-Length values seen by HTTP at Layer 7 they are 9534 bytes for gzipped page and 26487 for non gzipped page. A significant difference.

This shows that if there is compression carried out between the web server and the browser there is improvement seen on the network in terms of saved bandwidth, load time and therefore user experience.

Brocade ADX ServerIron Load Balancer

July 27, 2013

Layer 7, Load Balancing

I recently took out some time to study the Brocade ADX ServerIron Load Balancer is greater detail. Load Balancing is a common practice used to distribute workload across multiple resources. The recent trend in applications using Scale-Out methods instead of Scale-Up methods (for growth) has increased the importance of load balancing platforms both from the perspective of balancing traffic high availability of application servers.

Brocade’s ADX offers all the regular features of any good load balancing platform including L4 load distribution via multiple methods (Called ‘Predictors’ in ADX including round robin, weighted round robin etc) , L2-L7 Health checks, SSL Offload, Global Server Load Balancing i.e. GSLB via DNS, Persistence or ‘stickiness’ of connections, L7 content switching, TCP Syn attack protection, High Availability/Redundancy etc. The ADX can also be used as a NAT64 Gateway for IPv6 transition.

All this is well and good. Whats more is that the ADX also offers device level virtualization (upto 32 on the ADX10k) and hence secure-compliant multitenancy support, VXLAN VTEP termination, Application Resource Broker SDN related services for on-demand resource provisioning, an ARB vSphere plugin and the OpenScript Engine (Perl based) offering advanced content manipulation features.

Anyone interested in reading up can consult the following resources:

Looks like a great product with multiple use cases; now and in the near future.

Mind Map: Setup of Linux based L4 Load Balancing LVS + keepalived HA

March 7, 2013

Layer 7, Load Balancing, Mind Maps, SysAdmin

Disclaimer: Some steps may be missing or different as per varying requirements. I have deployed this in production using the above steps.

—networkfuel

Archive

Layer 7

What Does a Site Reliability Engineer Do ?

HTTP Compression

Brocade ADX ServerIron Load Balancer

Mind Map: Setup of Linux based L4 Load Balancing LVS + keepalived HA