Monthly Archives: January 2021

In IT there are different types of works. Not everyone realises this and not everyone knows which field of work they are in and which they are not in. For example an operations engineer who has worked in operations for a considerable time of his career might find it difficult to adjust in a deployment role or an architecture position. If you are in one line generally the hiring party knows your line and try to select a person from within that line. There is a difference between running an already built system and setting up a system. There is a difference between setting up a system already designed and designing of a new system. The design of the system is dependant on the requirements from that system and the tools and protocols you have at hand. The requirements of the application dictate the design. This work of requirement analysis and design is different from installing the system. Furthermore design and install are both different from running a system continuosly. These are all different skills and some people know all three and settle in one while some people know only one and work within that one. Generally these are titled Operations which is running a system and Deployment which is installing a system and Architecture which is designing a system.

Installation and Deployment fall under project execution work and Architecture and Design fall under project planning. Running a system and operations is generally considered non-project. It is good for an engineer to know the line of business he is in and choose to either intelligently acquire further skills within his line or acquire the skills of another line and move into that one. An Operations engineer might work in Ops for a few years and learn design skills and try to move into Deployment Project work. He may then move onto Design and Architecture work. Operations work are normally 24/7 all week 365 days of the year and this require weekend shift and night shifts and oncall work. Project and Deployment work are generally day time office hours work but the site installation work is somtimes done after hours in a planned maintenance window. Architecture and Design is mostly 9-5 business hours work. Some operations roles are now done internationally in a follow-the-sun manner across countries and across timezones. This means that in one country when it is daytime their engineers are on call and are running the system and after sunset another country wakes up and engineers in that timezone are handling the operations in their daytime. This is called follow-the-sun operations and in organisations running like this ops work also is in daytime only. Normally these are large organisation spanning the globe with presence in multiple countries.


I came across a job ad titled Systems Reliability Engineer which turns out to be a sort of a hybrid engineer skillset. Its details are copied below. Bear with me while I break things down.

The hybrid part in this is that it requires a combination of:

   – Linux/Unix/Virtualisation which falls generally under SysAdmin roles.

   – Networking which falls under Network Engineer roles

   – Storage and Server which generally falls under Storage/Backup Engineer

   – Kubernetes which is a container orchestrator and will provide a platform for a distributed application. This is a new field but I think its safe to say that Devops Engineer or Platform Engineer role titles handle this responsibility.

   – AWS/Azure/GCP Cloud which are Public cloud IaaS, PaaS or FaaS services.  This falls under Cloud Engineer or Devops Engineer.

A combination of the above knowledge bank is required to function as a Systems Reliability Engineer here.

And so we can say that a Systems Reliability Engineer is composed of a SysAdmin, Network Engineer, Storage Engineer, Devops Engineer, Platform Engineer and Cloud Engineer.

Can we break this down a bit more?

Starting with the Application workload, suffice to assume that the heavy weight applications which this guy will support require a networked distributed system to run. They are cloud native microservices based applications requiring a networked distributed system to run. The application needs CPU cores, RAM, Storage, IOPS, Bandwidth at such a scale.

Digging in further it can be observed that the individual components require an OS and Virtualisation (Linux/Unix/KVM etc – Sysadmin). Networking these individual components require L2/L3 networks (Routers, Switches – NetEng) and further on what can be called a Distributed System OS is required which will present not individual components but the servers/OS/router/switches/vswitches/ storage combo to the application. Kubernetes can be said to be the Distributed System OS providing orchestration and management of namespaces/containers. A distributed file system and storage servers will also be present. Certain parts of the application may be interacting with public clouds (AWS/Azure/GCP) to run certain workloads on public cloud instead of on the local infrastructure.

Oh dear, what a combination of knowledge bank and skillset this person needs!

In Computer Science we work on the principle of Abstraction Layers where there are layers which have science and phenomenon within themselves and then they provide a function or service to another layer. And so the whole system is composed of multiple Abstraction Layers interacting with each other. In this case this Systems Reliability Engineer requires knowledge spanning multiple Abstraction Layers. Traditional engineers have been functioning within their own Abstraction Layer. Their specific jobs have been complicated enough to require tips and tricks of that same abstraction layer to make things work. An engineer working in the networking abstraction layer knows how to troubleshoot links, routing, SFPs etc and an engineer working the SysAdmin layer knows what to do with the Linux OS, KVM etc etc. Similarly an engineer working on the Public Cloud may actually know the tips and tricks of 1 or 2 public clouds and not all 3. Kubernetes and container management is itself now an Abstraction Layer.

This job advertisement not only lists multiple abstraction layers but even within them it lists multiple tools. For example within Virtualisation it lists KVM, ESXi and HyperV all 3 famous hypervisors and within Public Cloud it lists GCP, Azure and AWS all three. So not only does it span abstraction layers but even within abstraction layers it is asking for familiarity with multiple versions of software.

In IT Operations knowing the right command or the right place to click sometimes matters a lot. Things dont proceed if you dont know the command or dont know where to click or what parameter to enter. Spanning Abstraction Layers and multiple tools within Abstraction Layers is a tricky job for IT Operations. I am guessing they will have a team and will manage the skillset of the team and not individual engineers. Multiple engineers with basic knowledge of the system and specific knowledge of 1 or 2 Abstraction Layers and 2 or 3 tools. The team level skill set management would be an important aspect here.

The rest of the job description suggests this is an operations job as they required full work week availability and troubleshooting skills as well. So this new hybrid engineer will be tasked with on shift troubleshooting work supporting customers and speaking to vendors etc. It is important to note that this is not a Project Deployment or Professional Services job where you are reviewing designs, testing solutions, submitting BOMs, reviewing equiptment lists, counting item, installing systems and configuring systems from scratch. This is an Ops tshooting break-fix role. As such it requires a troubleshooting mindset and will require sufficient knowledge of the systems functions and the individual components to identify which part of the system is causing a bug or service impact. Once you identify which part is broken (eg networking or virtualisation) then you might need to dig a bit deeper and review some logs within that component to a certain level. Thereafter they will make an intelligent decision on either actions to fix the component or whom next to contact to fix the problem. Each individual component will have their own level 3 support structure and vendor and this Systems Reliability Engineer will identify whether networking is broken or virtualisation is broken or storage is broken etc etc. He will then attempt a certain level of fix and if not then consult the right team or vendor.

As such when we look at the multiple skill sets required it looks very very complicated for one person to know all this. From my experience of 13 years IT still ongoing we are still in a siloed world where possibly a network engineer with a ccnp is progressing towards senior network engineer and CCIE or maybe only diversifieng with an AWS or Azure skill. A comprehensive non-siloed cross abstraction layer engineer with kubernetes, storage, public cloud, networking, virtualization, linux knowledge will probably be difficult to find because from what I see a lot of people are comfortable within their abstraction layer and such diversity is not necessary and is a big headache. Within networking which is my field I feel that network engineers are probably proceeding with deeper design knowledge or AWS/Azure diversification or Python Network Automation knowledge as a career path. Same might be true for say engineers within the Virtualization / Sysadmin layer who might be developing inside that abstraction layer. Further tricky is the part that you need this cross abstraction layer engineer to have ops and troubleshooting mindset willing to do shifts on weekends. There will be few people out there. Perhaps some incentives might be required to find the right diverse engineer working weekends. Incentives like permanent work from home or any nearby country accepted working the right timezone etc.

These are the new Hybrid Engineers.

Update: I later came to know that they have mentioned that they require 2 or 3 out of the skill set. So it appears they aee dividing skillset on a team level.