Archive

Distributed Systems

Given below is a prompt I gave to ChatGPT GPT4o:

“In Azure networking what is a routing domain and what are on its boundaries. What are the inter-routing-domain routing types and what are any translations on the networking routing domain boundaries.”

and here is the answer I got from ChatGPT:

In Azure networking, a routing domain is a segment of the network where all the resources follow the same routing logic and policies. Routing domains can be used to segment traffic, apply specific routing policies, and ensure network isolation. Each routing domain typically has its own set of routing tables that determine how traffic should be forwarded within and between domains.

Boundaries of a Routing Domain

The boundaries of a routing domain in Azure are typically defined by the following components:

  1. Virtual Networks (VNets): VNets are fundamental building blocks in Azure networking, and each VNet can be considered a routing domain.
  2. Subnets: Subnets within a VNet can also be routing domains, depending on the routing policies applied.
  3. Network Security Groups (NSGs): NSGs can define boundaries by controlling inbound and outbound traffic at the subnet or network interface level.
  4. Route Tables: Route tables attached to subnets define the specific routing within that subnet.

Inter-Routing-Domain Routing Types

Inter-routing-domain routing types in Azure include:

  1. Peering:
  • VNet Peering: Allows you to connect VNets within the same Azure region or across regions. VNets can communicate as if they were on the same network, but they maintain their own routing tables and boundaries.
  • Global VNet Peering: Extends VNet peering to allow communication across different Azure regions.
  1. VPN Gateways:
  • Site-to-Site VPN: Connects an on-premises network to an Azure VNet over a secure VPN tunnel.
  • VNet-to-VNet VPN: Connects two or more VNets using VPN gateways.
  1. ExpressRoute:
  • ExpressRoute Circuits: Provides a private connection between Azure datacenters and on-premises infrastructure, bypassing the public internet.
  • ExpressRoute Direct: Offers dedicated, private connections for high-throughput and latency-sensitive applications.
  1. Azure Virtual WAN:
  • Hub-and-Spoke Architecture: Uses Virtual WAN hubs to connect VNets and on-premises networks in a centralized manner.
  • Any-to-Any Connectivity: Facilitates any-to-any connectivity through Virtual WAN.

Translations on the Networking Routing Domain Boundaries

When traffic crosses the boundaries of routing domains, certain translations and adjustments might occur to ensure proper routing and security. These include:

  1. NAT (Network Address Translation):
  • Public IP Addresses: When traffic exits an Azure VNet to the internet, it typically uses a public IP address, requiring NAT.
  • Azure Firewall and NAT Gateway: These services provide NAT capabilities for outbound traffic.
  1. Route Propagation:
  • BGP (Border Gateway Protocol): Used in ExpressRoute and VPN gateways to propagate routes between Azure and on-premises networks.
  1. Access Control:
  • NSGs and Application Security Groups (ASGs): Apply security rules to control traffic entering and exiting a routing domain.
  • Azure Firewall: Acts as a centralized security policy enforcement point, applying rules to traffic crossing routing domain boundaries.
  1. Uptime and Performance Management:
  • Traffic Manager and Front Door: These services help manage traffic routing for high availability and performance, ensuring that traffic is directed to the most appropriate endpoints.

Summary

In Azure networking, routing domains help segment and control network traffic. Their boundaries are defined by components like VNets, subnets, NSGs, and route tables. Inter-routing-domain routing can be achieved through VNet peering, VPN gateways, ExpressRoute, and Virtual WAN. Translations and adjustments such as NAT, route propagation, and access control are necessary to manage traffic crossing these boundaries.

—————————- End of ChatGPT Response ——————————–

It’s obvious that if you know the fundamentals of packet forwarding then grasping how Azure or AWS is doing it won’t be difficult. Here is how I would put it:

An IP is an endpoint. A block of IPs is a subnet and is a group of endpoints. A routing domain houses groups of IP endpoints. A routing domain can be considered as a boundary behind which one or more subnets i.e. groups of IP endpoints exist.

The rest is in the answer of the prompt. A routing domain has things inside it (endpoints and groups of endpoints) and it has a boundary. There is inter-routing-domain constructs and there are translation and/or security constructs on the boundary as well.

In Azure, a VNet is a routing domain. As always the right prompt gives a lot of good information.

In Networking, Cybersecurity should now be Distributed Cybersecurity. Distributed Cybersecurity is where each organization and each home is independently capable of monitoring its traffic and inspecting it on demand.

This paradigm is where the Wifi AP at home is capable of sending its traffic in-out Data to somewhere within home and where the AP’s and Routers in an office setup are capable of sending their in-out information to somewhere within the local office.

This requires Netflow capable Wifi Access Points and Netflow capable CE Edge routers. It also requires a setup of a Netflow receiving station where the information sent by Wifi Access Points and Cisco routers in Netflow language will be received and displayed.

Two edge computing specifications are present.

Facebook OCP’s CG-OpenRack-19 and LinkedIn’s Open19.

They provide for Rack Layouts, Compute, Storage and Networking.

Networking for CG-OpenRack-19 is copied below. The servers sleds in the pictures appear to be single homed as per the colors. It would be interesting find out which protocol handles the Active Active state of the multi homed Compute and Storage Sleds if that is at all present.

OpenRack-19

Given here

Open19 gives 100G bandwidth capabilities and some details are on its website.

5G’s edge ultra low latency requirements would could require edge solutions and it would be interesting to see how things play out ahead.

This also brings to mind SD-WAN because these edge racks will be at least connected in a large WAN.

Google’s B4 is one of its software defined inter data denter WAN solution. Google’s Espresso is its peering edge solution. Espresso links into B4 domain via B2. This link has the details of Espresso as shared by the Google team.

 

Google-Espresso-B4.JPG

Google is not employing an army of networking engineers to run these because they are software defined and programmed bots will probably be doing operational tasks. To operate this network there are Site Reliability Engineers though.

Here is one public job advertisement that relates as to what an SRE is expected to be like:

We have reliable infrastructure and can spin up new environments in a couple of hours. Automate everything so there is more time for exploring and learning. Foster the DevOps mindset

What are our goals?
  Internationalisation
  Deploying multiple data centers
  Deploying every 5 minutes
Requirements
  Experience with Java or JavaScript in a Dockerised environment
  Linux Engineering/Administration
  Desire for improving processes
  Have a passion and most importantly, a sense of humour
Tech Stack (you DO NOT need experience in all of these)
  Kubernetes + Docker
  Terraform + Ansible
  Linux
  Kotlin + NodeJS
  ELK stack
  AWS

This is obviously an SRE for the servers side and the application enablement side of things. If there is a large software defined edge network like Espresso and a large Edge-to-DC network like B2 and a large software defined inter-DC network like B4 you will need a different SRE.

Here is Google’s version of a Site Reliability Engineer Job.

Job description
Minimum Qualifications

BS degree in Computer Science or related technical field involving coding (e.g. physics or mathematics), or equivalent practical experience.
3 years of experience working with algorithms, data structures, complexity analysis and software design.
Experience in one or more of the following: C, C++, Java, Python, Go, Perl or Ruby.

Preferred Qualifications

Systematic problem-solving approach, coupled with effective communication skills and a sense of ownership and drive.
Interest in designing, analyzing and troubleshooting large-scale distributed systems.
Ability to debug and optimize code and automate routine tasks.

About The Job

Hope is not a strategy. Engineering solutions to design, build, and maintain efficient large-scale systems is a true strategy, and a good one.

Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures that Google’s services—both our internally critical and our externally-visible systems—have reliability and uptime appropriate to users’ needs and a fast rate of improvement while keeping an ever-watchful eye on capacity and performance.

SRE is also a mindset and a set of engineering approaches to running better production systems—we build our own creative engineering solutions to operations problems. Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation. As SREs are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to solve a broad spectrum of problems. Practices such as limiting time spent on operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting and dynamic day-to-day work.

We can see that Google’s SRE Job Ad is all software along with large scale distributed systems requirements.

Now if we note this extract from the Wikipedia SD-WAN article:

“With a global view of network status, a controller that manages SD-WAN can perform careful and adaptive traffic engineering by assigning new transfer requests according to current usage of resources (links). For example, this can be achieved by performing central calculation of transmission rates at the controller and rate-limiting at the senders (end-points) according to such rates”

and we also note this extract:

“As there is no standard algorithm for SD-WAN controllers, device manufacturers each use their own proprietary algorithm in the transmission of data. These algorithms determine which traffic to direct over which link and when to switch traffic from one link to another. Given the breadth of options available in relation to both software and hardware SD-WAN control solutions, it’s imperative they be tested and validated under real-world conditions within a lab setting prior to deployment.”

We see Algorithms.

Its clear that there are different algorithms running these Software Defined networks (Google’s software defined Espresso, B2, B4 and Jupiter). These algorithms automate, kick in and optimize. Google becomes a large scale distributed system with various algorithms here and there. While Software Architects and Software Engineers will have developed these algorithmic nodes and programmed them into network devices/servers an SRE is the human who will operate the system. A team of SREs.

One aspect of Networking protocols is that they are for a multi-vendor, multi-enterprise and multi-domain environments. They provide simple consensus to connect two or more different network devices.

To take a merchant silicon network device like OCP’s Wedge and OCP style servers and make one large network like Google out of it will require software engineering to remake the NOS (Network Operating Systems) part at least. There will be atleast a Meta-NOS, somewhat running on top of a typical NOS which would handle the SDN – software defined algorithms. In addition to the SDN controllers talking to this Meta-NOS. Multiple layers of SDN controllers will be talking to each other and you can call this a network protocol or an SDN algorithm but it will be part of distributed systems software architecture and it will be programmed in place by software engineers.

Large Scale Distributed System on Merchant Silicon Hardware – Software Defined Meta-NOS – SDN Controllers – Hierarchical SDN Controllers – Algorithms.

Sounds like a Program Management task instead of PMP scale Engineering Project Management task. You will need Mathematicians to sit with Network Architects, Distributed Systems Architects and Software Architects. The Mathematicians will do give the algorithms. They will be important too.

Fun times.