I have to start with this again lol... Like wtf is even going on there. Just an amazing graphic that everyone should be using in professional presentations. Cloud... man... I don't even know where to start... How can I speak to cloud in enough detail to apply to what I'm doing without leaving anything important out? It's such a big topic, I mean, to be completely thorough, we'd have to realistically get down to the transistor level up to the clock level up to the physical machine level up to the OS level up to the virtualization level... and I've absolutely left out 1 million steps on the way... And that's JUST for computing power... I'm not even touching RAM or CPU at this point...
I think I will keep this cloud computing discussion at a processing level at this point because that's really the only scope that I need to worry about when running a NN of this size... RAM and storage will surely become a concern if I were to use much larger data sets, but this MNIST data set (and my Chi / Lars face detection data set) fits easily on my MacBook even.
I'll have to preface this entire section on cloud with the fact that I'm not formally educated in the area of cloud. I do not come from a CS or Comp Eng background, and I never came from a background of building my own computers as a child... etc. Maybe I'm not suited to be making this post, but I will need sort out some details in my head for my own good anyways so let's just see where this goes.
Let's just start by looking at my own computer as a start...
My MacBook pro has a 2.53 GHz Intel Core 2 Duo. Okay, what the hell does this mean lol. Well, each processor has the concept of a clock. The clock is the frequency in which the processor can process one instruction safely before being ready to process the next. This is probably the first time I'm ever referencing anything I learned in 1st or 2nd year University, but at the end of the day, a computer is thinking in 0's and 1's. At a physical level, these 0's and 1's are being represented by generally 0V and +5V respectively.
Let's say we wanted to send the following message "10101010" to another party. The "signal" would look something like this:
The pins in the processor would be fluctuating between 0 and 1. The frequency in which the processor can actually switch between 0 and 1 would be the clock speed. My MacBook can basically perform 2.53B switches of this type in a second! What is the limiting factor here? Why that number?
Well, when we switch between 0 to 1, and 1 to 0, it's not an instantaneous switch. To go from 0V to +5V in a physical copper wire, you basically need to inject a current into the copper wire. A current is nothing but electrons running through the copper wire, so it literally takes time for the electrons to conduct through the wire. If we were to look more in depth at what happens when we switch from 0 to 1, it's something like this:
The green is indicating what our desired signal waveform looks like, and the red is indicating how long it actually takes to reach that desired voltage level. This is something called propagation delay and is one of the primary reasons why there exists a limit on computational power. You could imagine that if you wanted to increase the clock speed, you'd be taking voltage measurements when the voltage isn't quite at 5V yet (maybe at the time you take the measurement, the pin is only showing 2V or something like that), so the computer can't quite recognize that the input you want at that point is a "1". If the pin is currently at 0 and we start transmitting messages with a super high clock speed and the pin never goes past 2V, then perhaps every single bit we're trying to send would be perceived as either 0V or 2V, which might generalize to 0V, or a message string purely of 0's! Obviously, 10101010 is not the same message as 00000000.
Okay, so that was quite the tangent, but I think a necessary dark place in my life I had to revisit lol. With a single processor, however, there exists (today) an absolute limit on how fast the clock speed can possibly be. How do you go faster than that? The answer lies in parallel processing. Processors nowadays have multiple cores, which are individual processors that work together / separately to achieve a certain task. If we're trying to add together a trillion numbers, perhaps one processor can take the first half trillion numbers and the second processor can take the second half trillion, then when they both have their aggregated results, one of the processors can add those two aggregates together to get the final sum. We would have taken half the amount of time to compute that because the two processors were working in parallel. This is absolutely analagous to, let's say, two construction workers working on different parts of the building. Two construction workers would get the job done in half the time, and 10 would get the job done in 1/10th of the time. Simple math. There would be additional overhead that make the exact math a bit more complex, but for all intents and purposes, we as people who are worrying about the end-goal of face detection don't really have to get into as much detail because someone else smarter than us has figured that out already! Lucky us!
This can be abstracted to an even higher level, where instead of multiple cores in a processor, we can literally use multiple computers altogether. This is where we'd bring in the concept of computer networking, where one computer (a set of CPU, RAM, and HDD) can interface with another computer because, again, a single computer can only have so much CPU / RAM / HDD before physical limitations come into play again. Working at a telecommunications company definitely helped me understand more of the inner workings here, but for the scope of this post, we don't quite have to worry about WAN networking right now (we might have to when we talk about AWS).
A product like Hadoop is a suite of computational and networking configurations that bring a cluster of computers / servers together in this way to store and process common data. Hadoop has its own filesystem, HDFS (Hadoop Distributed File System), which stores files in a very particular way. Let's say you have a cluster of 5 servers that you want to store and process data on, one of them will act as the master node and the rest will act as slave nodes. Through LAN networking, the master node will coordinate activities with all the slave nodes. The master node is somewhat of a construction site manager, if we want to think about it that way, giving commands to the slave workers, or the construction workers who are doing the actual work themselves. The master node facilitates who stores which sets of data, and who processes which sets of data.
On the storage side, let's say we have 100TB of storage across all 4 slave nodes and we want to store a 1GB file across HDFS, this 1GB file would actually be distributed across the 4 slave nodes' HDDs on each. HDFS stores files in a way which replicates parts of the file on multiple nodes, the first 100MB of the file maybe stored on 3 of the 4 nodes, for example. In this way, HDFS is able to build in redundancy (if part of a file gets lost on one node, it can be replaced by referencing other nodes), but more importantly, we are nicely set up so that multiple nodes can start processing on various parts of the file at once!
I don't think we'll be using Hadoop here (maybe in future projects), but this concept of how Hadoop works provides a great generalization of parallel processing in general! Let's take a look at virtualization next.
Virtualization is another concept that we have to cover first before we can talk about cloud. Virtualization allows us to share a single server or a cluster of servers' resources between multiple operating systems. VMWare or Oracle VirtualBox are the classic personal and commercial virtualization platforms, but now virtualization has gone way further with cloud providers and docker. Anyways, let's back up first.
The easiest way to think about virtualization (at least for me, someone who is not an expert on virtualization haha) is to think about a real life application. Growing up, I had a PC with Windows (in University, I had a lime green Sony Vaio with Windows XP on it... baller). When I had that laptop, I needed to access linux on it here and there... I don't even remember why anymore because I never really did any software back then haha... but what I would have to do was re-partition my hard drive to have a linux partition and a Windows partition. Doing that was often a process in itself because if you already had your entire hard drive partitioned to Windows (which is what it comes like when you buy the laptop), you had to shrink the Windows partition and if anything went wrong you basically had to reinstall both OSes from scratch.
Then came VMWare where you could "virtualize" an OS. VMWare was an application that ran within your Windows environment, but was able to tap into the computer's hardware resources to dedicate a maximum amount of resources to the OS which it was virtualizing. You then had Windows regulating the environment that linux was running in. The Windows application was saying "this linux OS can only use X% of the computer's CPU, Y% of the CPU's RAM, and we'll dedicate Z% of the HDD to that OS". No need to repartition, no need to reinstall OSes, we could just load up an image in VMWare and be off within a few seconds! Running a more extensive application? No problem, just dedicate more CPU and RAM to that virtualization instance! Need more resources back for your Windows instance? No problem, just save the image and spin up another virtualized environment and dedicate less resources! It made switching between OSes much easier, and even allowed the Windows and linux environments to talk with each other through transferring files, sharing ports... etc.
In the previous example, I had one computer / server. Just my own laptop. When I worked with an internet service provider, we had datacenters full of servers which we were able to cluster into a large pool of resources. At that point, we were able to spin up "virtual" operating systems which could garner more CPU, RAM, and HDD than a single affordable server could provide. Not only this, but the pool of resources could be used in a much more efficient way! If 3 users used up 95% of the cluster's resources, that last 5% could be given to someone who didn't need much compute power. In reality, the environment spun up with that 5% likely wouldn't ever even exist because it wouldn't be worth it to actually manufacture dedicated hardware for such a weak use case. This setup generally required an admin of some sort that controlled the VMWare console where the pool of resources could be distributed. It would also require a local instance of the VMWare environment installed on-premise in your datacenter and your team would have to be trained in the VMWare interface. At the time, this was pretty crazy because resources had never been used this efficiently previously. Let alone if you messed up an image, you could just turn down your environment and spin up a new one from your last save point. It also required way less hardware concerns because you could have a dedicated hardware team in the datacenter that dealt with faulty hardware. We haven't even touched what it takes to operate a datacenter... you need real estate managers, power planners, vendor managers, network engineers... not to mention janitorial staff, security, administrative staff, entire management hierarchies to operate at this scale... The end-users would never have to worry about, or even touch any of this other than the thin client that they were using to log into the virtual machines.
Now... let's abstract this one level higher. Cloud providers like AWS have taken this to another level. Now, instead of enterprises needing their own VMWare environment with admin teams who were specialists in VMWare, AWS provides a web splash screen where you can just... well... request a virtual machine! You choose which region you want the machine in, give them your credit card, and you're off.
To spin up a virtual machine, we pick an instance type and... well... yeah I already went over the credit card thing didn't I haha...
On the AWS instance type page, we see the "accelerated computing" section. This will give us enough processing power to run our NN within a reasonable time frame (well... hopefully). The P-series (general purpose GPU compute) will allow us to leverage GPU computing to work even faster!
I don't know that much about CPU vs GPU computing, but this is what I do know. CPUs and GPUs perform different functions. CPUs are more general purpose built for a variety of compute tasks (playing a song, making a powerpoint, browsing the web), whereas GPUs perform one specific function and are designed to do that one function very very well... render graphics! It turns out that the mathematical calculations that a GPU has to run can be done largely in parallel, so a GPU consist of many many cores but don't have quite high clock speeds, whereas a CPU generally have 2 or 4 cores for a personal laptop but has much higher clock speeds.
It turns out that the NN can leverage GPU parallel processing to perform all the calculations that it needs to perform! All the linear algebra and matrix manipulations lend itself well to GPUs.
AWS' P-series provides multiple options for how many GPUs we want in our machine:
What does this all mean in terms of how speed gains? I really have no clue right now! I think I'll just have to dive into it and learn from trial by fire. Since p2.xlarge will be the cheapest, I'll probably start with that and just see how it goes. I can always ramp up in resources if need be.
By using AWS, I'm now crossing into the realm of actually paying money to have to run some of these models...
The age old cloud question... Am I really going to go off and build p2.xlarge system that AWS provides above? I could... but I feel like I would really deviate from my main objective, which is to build a frickin face detection model. This is the beauty of cloud. With one click of a button, I can have something that is managed for me at a lower up front cost than if I would have built the machine myself. Obviously you can get into an entire discussion around how much you're going to use it and what you are going to use it for and at some point if you're beating the crap out of the virtual machine it may have been more economical to build it yourself, but the other factor is just the lead time in getting something like this off the ground. If my end goal is to build, train, and test a model, the prioritization to reach that end-goal with the path of least resistance will lead to basically giving AWS my credit card.
At a few cents on the hour, it really is difficult to justify going out and potentially spending thousands on building a machine. What if you go out and build this machine and realize that you ordered the wrong part? More money and time. What if you use the machine for a year and a part gets outdated? More money and time. On the AWS side, the solution to both of these problems is (well, you can't really order a wrong part, but I guess you can spin up a wrong EC2 type) to click a button and turn down your environment, and click another button and turn up another environment. Both money and time are drastically reduced. As long as... yup... Amazon has your credit card.
Ok, sorry, let me get down to some math here. AWS charges EC2 instances by the hour. The P-series is broken down like this:
So the p2.xlarge I'm looking into will cost me roughly a dollar an hour. BUT... AWS isn't finished blowing our minds yet... it's got something called "spot pricing". Spot pricing works essentially like a stock market. It's based on supply and demand. Remember how we talked about clustering servers to become an abstracted "pool" of resources, and how that guy at my old job could take up the last 5% of the pooled resources? Spot pricing works in a similar fashion. Because AWS' datacenters can be seen as simply a pool of resources, you could imagine that there would be a fraction of the resources that go un-used as any time. At my old job, our network capacity was always planned to never be utilized above 80%. Whenever 80% was reached, we knew we had to increase the capacity because we had to plan ahead for growth. Let's say only 50% of AWS' resources are being used at any point... what does AWS do? It basically auctions off that last 50% (or if they mandate that they never go above 80% of their capacity, they auction off the last 30%) to the highest bidder. The kicker with spot pricing is, if somebody outbids you, you lose that EC2 instance immediately. No warning whatsoever. So you have to pick your poison, do you want a dedicated instance at a set price? Or do you want to roll the dice and get an instance at the fraction of the price but risk losing it at any time?
It's a fun decision, isn't it? Especially when AWS claims that the savings can be up to 90%. At the time of writing this (July 2017), a p2.xlarge seems to be going for ~$0.17, which is indeed more than an 80% discount off the original price.
If I prepare my jupyter notebook beforehand with pre-populated code, I can just bring up the EC2 instance to train my model and be on my way, getting away with spending only 0.17 hours, or even if it takes me let's say 5 hours, just shy of a dollar! Not too shabby at all!
That's probably all I got for cloud right now, I'm going to continue exploring AWS in the next post, and hopefully within the next 100 or so posts, I can actually start training my model ;).