Last post, I looked at the concept of cloud and its implications on how we actually build systems these days. As much as I'd like to now actually dive in and use Amazon Web Services, there are a few AWS concepts that we have to cover first so we can do something as simple as even making our virtual machine accessible to the public internet. This post will cover many of the fundamental networking and infrastructure building blocks within AWS.
I just want to throw out a blanket paragraph or two on just networking in general. Networking is an area of study and research in and of itself... and there is no way I'm doing any justice to the field, but I feel like I need to preface the following sections with a quick overview of vocab and basic concepts if nothing else. I don't claim to be an expert in computer networking, but I think I know enough to stand up a basic infrastructure on AWS.
First, we kind of have to talk about what the internet is. The internet, general, is a bunch of servers, routers, and switches that end up creating physical links all across the world for end-users (servers, laptops, phones, IoT devices... etc) to communicate with. At the physical layer, there are physical channels of communication all around us and all around the world... not too different than two cans connected by a string, to be honest!
Lol... all these great stock images... Anyways back to networking. The string between the cans basically carry analog vibrations from one can to another and act as a channel of communication. In a similar way, a copper wire conducts electricity that are quantized into a digital signal between two "devices", but that copper wire is playing the same role as the string between the cans - It's a medium where some type of matter travels. The copper wire was traditionally the backbone of digital communications - Old telephone switchboards connected two phones with copper, ATM networks provided low-speed internet in its early hay day over copper. DSL technology provided high speed (well, in ITS hay day as well haha) internet through copper. Nowadays, creativity and resarched have provided us many different ways to communicate. We have fibre optic channels now that utilize photons as the matter representing the signal (light moves much faster than electricity), and we have wireless radios that transmit electomagnetic waves through the air, which now acts as a medium. Great for us, we don't even have to build or configure AIR!
Having all these mediums of communication channels around us, we can start to use them as a base for smarter communication. Through the internet, we can reach any destination that is connected to the broader internet network. There are literally fibre optic cables running through oceans to connect two continents. Luckily for us, photons travel much faster than we can swim across an ocean! The internet, then, is just the entire collection of all these mediums that are somehow connected to each other. Internet Service Providers (ISPs) like AT&T and Verizon (or Bell and TELUS here in Canada) own much of the physical infrastructure that cover their respective countries, but also the peering capabilities to connect to other ISPs networks in other countries. The broader aggregation of these massive networks that cover entire countries essentially make up the backbone of the internet. If I need to reach Netflix from my home here in Edmonton, Alberta, Canada, the signal from my laptop would be going from inside my home to my internet service provider's network (let's say TELUS), to the network which is peered in the states to TELUS (let's say AT&T), to Netflix's servers in what I would assume to be Seattle because that's likely their datacenter that's closest to my house / western Canada.
The image below shows a rough (maybe too detailed) sketch of what a network infrastructure could look like from your home to your ISP to another peering ISP:
The "customer premise" is our home, and the "local carrier end office" is our ISP's switching office, and the "AT&T ISP pop" is the point of presence (POP) that our ISP connects with AT&T, where AT&T will then take on the traffic and route it to the final destination.
But wait, how do we define a destination? Well, a destination on the internet is defined by an IP address. IP is the Internet Protocol, a digital signal protocol which allows us to route from one machine on this broader network to another. Every machine on the internet has an IP address, and in fact, mine seems to be 172.20.10.2 at the time of writing this:
Here, "en1" indicates my network adapter on my computer, and my "inet" address refers to my "internet address". Every single node on the internet will be defined by an IP, and AWS EC2 instances are no different.
In the last post, we discussed the idea of an analog electricity signal being converted into bits in a copper wire. Again, the signal "10101010" might be represented with pure voltage across a wire like:
You could imagine how we could represent this with photons or eletromagnetic waves in the air as well, but how does this actually represent an IP address, or even any message that we want to send? Well these analog signals are converted into digital bits, which are then organized as bytes (8 bits, which can be used to represent alphanumerical and symbol characters), which are then organized as packets (groups of bytes). There exists tons of protocols to send messages within so many different contexts, but for the most part, most internet traffic transmit packets on top of two protocols: Transmission Control Protocol (TCP) and Internet Protocol (IP).
TCP is a protocol that adds addition safety mechanisms to make sure a package was received in its entirety (e.g. if the transmitting machine loses internet connectivity during the transmission of a message and comes back online right after, how does the receiving message know that it received the full message, or if it realizes that it didn't get the full message, how does it ask for it again? I won't go into the details here, but this is what TCP is all about.
IP is the protocol which adds headers to the packets that tags a message with a source IP address and destination IP address. This IP header is taken by switches and routers and those devices then, through a series of routing protocols, figure out the best route to traverse the internet to find the destination. For example, there is a routing protocol called... Routing Information Protocol (RIP) which chooses the shortest number of "hops", but there is another protocol called Open Shortest Path First (OSPF) which chooses the fastest path in terms of bandwidth.
Anyways, this is getting a bit out of hand, but I feel that it will be necessary to delve into the other offerings of AWS.
Ok. Let's just start with AWS' blurb for VPCs:
Amazon Virtual Private Cloud (Amazon VPC) lets you provision a logically isolated section of the Amazon Web Services (AWS) cloud where you can launch AWS resources in a virtual network that you define. You have complete control over your virtual networking environment, including selection of your own IP address range, creation of subnets, and configuration of route tables and network gateways. You can use both IPv4 and IPv6 in your VPC for secure and easy access to resources and applications.
You can easily customize the network configuration for your Amazon Virtual Private Cloud. For example, you can create a public-facing subnet for your webservers that has access to the Internet, and place your backend systems such as databases or application servers in a private-facing subnet with no Internet access. You can leverage multiple layers of security, including security groups and network access control lists, to help control access to Amazon EC2 instances in each subnet.
One thing that I never went through in the networking overview was subnets, and that's essentially what a VPC is in Amazon's world. A subnet is a logical separation of networks, logical meaning software-defined, and networks meaning a range of hosts / IPs.
Amazon's example in the second paragraph is also pretty useful as well. If we had an app where the webservers would be open to the public, and the databases wouldn't be, we would want some type of logic to say "open up public internet access to our webservers, but only restrict the databases to be accessed from an IP or a group of IPs". Are we going to write some type of code, mapping every single address to its permissions? Subnets provides a logical way of grouping hosts / IPs to be able to implmement rules on that entire subnet. Traditionally, IP routers were built for the purpose of being able to understand the network that they live in. A router not only has the context of who is connected to it, but it indirectly has context of the entire network and how far certain subnets are from it (either by hop count, bandwidth, or another more complex metric). To me, a virtual private cloud is not necessarily a subnet (a VPC can actually have subnets within it), but it is a logical separation of hosts, like a subnet.
I'm still trying to feel my way around all the different AWS components, but I think a VPC is the highest level of context for networking your application would ever need. A VPC should encapsulate all your components, which would then be further split by security groups which I will talk about later. Any small applications seems like it would be encapsulated in a single VPC. If an application has complex network requirements with different access control list (ACL) configurations, we can break it up into multiple VPCs, but for this project, I don't think we'll need more than one.
Again, let's start with the AWS blurb for an Internet Gateway:
An Internet gateway is a horizontally scaled, redundant, and highly available VPC component that allows communication between instances in your VPC and the Internet. It therefore imposes no availability risks or bandwidth constraints on your network traffic.
An Internet gateway serves two purposes: to provide a target in your VPC route tables for Internet-routable traffic, and to perform network address translation (NAT) for instances that have been assigned public IPv4 addresses.
and for a Route Table:
A route table contains a set of rules, called routes, that are used to determine where network traffic is directed.
Each subnet in your VPC must be associated with a route table; the table controls the routing for the subnet. A subnet can only be associated with one route table at a time, but you can associate multiple subnets with the same route table.
For all intents and purposes, we'll need to configure these components so our VPC can actually have access to the internet. When I worked at an internet service provider, we essentially were the internet, or at least we were for Canada. There wasn't really a need for an "internet gateway" per se, but there were gateway routers that connected to other subnets, or other networks altogether (e.g. a gateway that would lead to AT&T's network, which would lead to the "rest of the internet" or parts of the internet that were in the USA.
When the ISP comes to our homes to install the necessary equipment for us to have internet access, they install a small router, something that usually looks like this:
I'm with TELUS, so mine looks exactly like that. This router serves as a "gateway" to the rest of the internet for me, and the router has an internal route table which has a default route. The router has context of every device that is connected to it... my laptop, my phone, my girlfriend's laptop, her phone... etc, but when my laptop wants to reach Netflix, it will ask TELUS' DNS server for Netflix's IP address, which is something that my home's router won't know how to handle (would be a crazy to task to update every router in the world with every single address that exists). Whenever a router sees an address that it doesn't know how to route to directly, it will send it to its default route, and in the case of my house, the default route is the next router in the closest TELUS central office in my neighbourhood (where multiple houses / apartments physically aggregate to). The traffic then traverses through the network to higher level routers until it reaches the backbone network where traffic is now being routed across the country or across borders altogether.
This is an extreme simplification of a network, but you can see the "edge router" labelled there act as a gateway, and its default route is to send traffic to the backbone router that it's connected to.
In our AWS architecture, the internet gateway will basically be that piece that connects our VPC to the rest of the internet. When traffic is routed to our VPC from the public internet, those packets will be recieved by the internet gateway and the gateway will route directly to the correct EC2, database, or whatever component is the destination. When a component in our VPC wants to talk to the public internet, the traffic first gets routed to our internet gateway and our internet gateway will route it to some magic router somewhere that will carry our traffic where it needs to go. The whole point of our internet gateway is that we don't have to worry about what happens any further upstream past it to Amazon's core network. It'll hit some Amazon routers, some ISP's routers, and then our destination's routers.
Subnets, as I talked about a bit in the VPC section, are basically just a logical collection of IPs that we can manipulate as a logical group. Not really anything else I want to say here other than exploring subnet notation. First, I want to talk a bit about IPv4 vs IPv6. When the internet was "created", I suppose haha, IPv4 addresses were created. IPv4 addresses take the form of XXX.XXX.XXX.XXX. Remember, my IP address was 172.20.10.2. Each section of the IP address is represented by 8 bits, so IPs range from 0.0.0.0 to 255.255.255.255. This is good for
$$255\times 255\times 255\times 255=4,294,967,296$$~4.3B IP addresses. When they first designed IP addresses, the thought of billions of IP addresses were probably overwhelming. This was the days where PCs weren't even a household item. With 6B people on earth, not all of them owning a device, not all of them even having internet access, 4B IP addresses seemed like more than enough... they were probably thinking they wouldn't even need more than a few million!
Nowadays, it's not uncommon for a North American citizen to have 2 - 4 devices (laptop, PC, phone, tablet), and we're entering the world where devices don't even belong to people anymore. The whole internet of things movement says that I can have thermostats, humidity sensors, door sensors, smartwatches, and a whole wack of other sensors that report back to some centralized hub through the internet, and all of these devices need to be assigned an IP address! Thus far, in DSL architecture, there exists a mechanism called NAT where your home router can just assign your own devices an IP address that only the router knows about (therefore it can re-use addresses that may also be used elsewhere in the world) and the router will do the translating itself from a device from within its own Local Area Network (LAN) to the Wide Area Network (WAN) which is the network outside of your home LAN. The growing amount of devices in the world is pushing IPv4 to start running out of addresses, even with mechanisms like NATing because there are more and more households and products that connect to the internet these days.
In comes IPv6 addresses. IPv6 takes on the format of 2001:0db8:85a3:0000:0000:8a2e:0370:7334. Much longer, much more characters can be used to represent an address, and ultimately, we get $3.4\times 10^{34}$ addresses. I have no idea how sustainable this is because clearly the lesson learned from before was we have no clue how fast progress moves going forward.
IPv4 is still supported by the large majority of devices, and a subnet is defined as a range of IPs, so subnets use subnet masks to define this range of IPs.
Let's say we just wanted to define a single IP address. My IP of 172.20.10.2 can be represented by the IP 172.20.10.2 and the subnet mask 255.255.255.255 which can be summarized as a /32. The mask of 255.255.255.255 is saying that every single bit matters in this range.
Let's say we wanted to assing a subnet to the IP range 172.20.10.XXX, where XXX doesn't matter. Only the first 3 sections of the IP address matters. This would allow us to assign the last section as anything from 0 to 255 and they would all be in the same subnet (for a subnet of 256 addresses). This would be represented as the IP 172.20.10.0 and subnet mask _255.255.255.0 which can be summarized as a /24. The /24 comes from the fact that the mask can be represented groups of bytes. 255.255.255.0 becomes 11111111.11111111.11111111.00000000_ expanded out, and here, clearly only 24 of the bits are being used in the mask.
As a last example, A subnet of only 2 addresses would obviously have a mask of 11111111.11111111.11111111.11111110 where the last bit is not being masked, and therefore can be flexible in being a 0 or 1, allowing for 2 addresses.
The table below summarizes how we can assign addresses from a single IP to a range of 4.3B IPs:
We'll see this notation a lot when assigning our IPs to our EC2s.
AWS blurb:
A network access control list (ACL) is an optional layer of security for your VPC that acts as a firewall for controlling traffic in and out of one or more subnets. You might set up network ACLs with rules similar to your security groups in order to add an additional layer of security to your VPC.
We're getting a bit into security here, and we have to familiarize ourselves with ports really quickly. Ports are defined under the Internet Protocol (IP) as an additional detail to the destination IP address. Ports are more on the application layer whereas an IP address itself resides in the network layer.
For example, on our laptop, maybe we're browsing the web via Chrome while pulling information from a MySQL database. Chrome is generating packets to send to reddit while MySQL is connected to a remote database somewhere. How does the computer know that the next packet that comes in should be given to Chrome or our MySQL client? It does this via ports. Web requests are generally sent on port 80 while MySQL connections are created on port 3306. Incoming traffic will always come to my IP of 172.20.10.2, but will also be accompanied by a port, like 172.20.10.2:80 for web traffic or 172.20.10.2:3306 for MySQL. In fact, my jupyter notebook here runs on port 8888.
An Access Control List (ACL), then, defines firewall-level rules which grant or deny permission from different subnets / IPs / ports to other subnets / IPs / ports. We could be writing ACL rules to govern
We can think of a variety of situations where we'd want to get down to the granular details. Maybe one group of hosts are admin users, and they should be have access to all subnets within our VPC. Maybe one group of hosts are database administrators and they should have access to the subnet containing all the database clusters but only on port 3306. Really depends on our use case, and would absolutely come in handy when we need to manage users or restrict the privacy of data.
I don't anticipate we'd need too many (if any at all) ACL rules for our current projects, but perhaps as I do more and more projects, I'd want to isolate their architectures.
Security Groups are a concept that's abstracted in AWS. Security groups don't traditionally exist in a network architecture, however AWS has introduced not only the ability to spin up virtual machines and traditional network components, but they also provide many managed, out-of-the-box services as well. One example is their Relational Database Service (RDS), which provides you a managed relational database service (e.g. MySQL or PostgreSQL). How is this different from spinning up an EC2 and installing MySQL on there yourself? Well, from an end-user perspective, perhaps nothing! You're still able to connect to whatever database is spun up and query it in the same way. From an admin perspective, it could be a world of a difference. You spinning up an EC2 and installing MySQL means that, well first of all, that you actually have to go throught he process of installing it. You also then have to worry about the operations of the database from the ground up. Assigning storage space, backing up your database, clearing transactional logs... all the annoying stuff that you as an end user may not want to deal with necessarily. Using Amazon's RDS service, we have a console where we can control how frequently our database backs up, decide on whether we want to auto-scale our compute and storage resources to dynamically upgrade / downgrade on demand, migrate databases, and a whole wack-load of other features you'll have to explore yourself. AWS goes as far to even spin up an entire Hadoop cluster, consisting of multiple EC2s all working together, through a simple and easy to use web interface.
So why am I going through all these tangents? Because Security Groups abstracts the ACL concept to include not only hosts and subnets, but to these managed services as well! AWS' Security Groups blurb:
A security group acts as a virtual firewall for your instance to control inbound and outbound traffic. When you launch an instance in a VPC, you can assign up to five security groups to the instance. Security groups act at the instance level, not the subnet level. Therefore, each instance in a subnet in your VPC could be assigned to a different set of security groups. If you don't specify a particular group at launch time, the instance is automatically assigned to the default security group for the VPC.
As AWS states, the Security Group operates at an instance level, and while EC2 instances can be summarized by an IP address, an entire Hadoop cluster or database cluster may not be able to. In this way, Security Groups are a bit more flexible to the abstracted services that AWS provides in addition to EC2s.
I've already mentioned EC2s a couple of times, but I realized I never actually went through what an EC2 actually is haha. It's a virtual machine. Linux or Windows. Various combinations of CPU / RAM / additional components (e.g. GPU) Easy as pie.
Those are some of AWS' major components. That's enough to at least get an EC2 up and running, which is basically all I want to do for this exercise.