Introduction to Course Project

Introduction

  • The goal of this project is to walk through the building blocks of modern distributed computing infrastructure

  • The best way of learning about distributed and cluster computing systems is to build one yourself

  • We will use Clemson University's research computing infrastructure, including its HPC component (Palmetto), Big Data Analytic component (Cypress), and other peripheral storage compoments as our inspiration.

Overview

The project will be divided into five interdepedent assignments:

  • Assignment 3: Design and deploy a cluster of computers (S1)
  • Assignment 4: Augment the S1 cluster with shared file systems (S2)
  • Assignment 5: Design and deploy a Hadoop cluster using Hortonworks' Hadoop Distribution Platform (S3)
  • Assignment 6: Integrate S2 and S3 to enable mixed workflows (S4)
  • Assignment 7: Augment S4 with a scheduling system (only need to be able to manage component S1)

General Expectation

While each assignment will have its own requirements, they all share the following expectations

  • All assignments will be deployed on CloudLab inside an OpenStack cloud computing environment
  • Deployment process should be completely automated via Github repositories and Python scripting
  • Operating System for all components is Oracle Linux 7.4

Team Formation

  • All remaining assignments are team assignments (mandantory)
  • Each team can have a maximum of three members
  • Each team needs to maintain a public Github repository containing their CloudLab profile for the assignments. This repository will be reused and updated as we progress through the assignments

Resources

Additional Work for Graduate Students

You are to select and write a 8-10 pages literature survey on one of the following topics (first come first serve)

  • Libraries and frameworks for distributed and parallel file systems
  • Libraries and frameworks for schedulers in computer clusters
  • Libraries and frameworks for distributed stream processing systems
  • Libraries and frameworks for message-oriented-middleware systems