Data science is one of the most in-demand skills in the world today. Harvard Business Review recently published an article with the title: "Data Scientist: The Sexiest Job of the 21st Century". This class will prepare you well to become a data scientist.
Data science involves the application of scientic methodologies to extract understanding from and make predictions based on data sets from a broad range of sources. Data science involves knowledge and skills from three areas: 1) computer science, 2) math/statistics and 3) domain specific expertise.
The objective of this course is to teach the analytical mindset & programming skills relevant to data science. Students will learn the Python programming language, along with a set tools for data science in Python, including the Jupyter (IPython) Notebook, NumPy, Pandas, matplotlib, PySpark, and Tableau data visualization. Students will learn skills that cover the various phases of exploratory data analysis: importing data (web, JSON, CSV, relational data), cleaning and transforming data, algorithmic thinking, grouping and aggregation, visualization, time series, statistical modeling/prediction and communication of results. The course will utilize data from a wide range of sources and will culminate with a final project and presentation.
As one of the faster growing fields today, data science has new tools coming faster than publishers can create books. To supplement the traditional readings on data science, we will be sharing state of the art tools and environments that are used by the best practitioners today: including matplotlib and Tableau for data visualization, Spark and PySpark for big data analytics. We aim to equip you with the latest industry thinking & toolsets to become a productive data scientist.
In this course you will learn:
We will be using a number of resources throughout the class.
The following resource are optional, but highly recommended:
Your grade for this course will be determined as follows:
All work submitted for grading must be the original product of the signatory individual or team members. Unless explicitly stated otherwise, sharing of work between teams is not allowed. You should not make use of materials connected to previous offerings of this course or related courses at other institutions. When referencing the words or ideas of others is appropriate, you should make proper citation, but that will not excuse the lack of individual contribution. When in doubt, check with me.
Confirmed violations will result in a failing grade for the entire course, a notation in the student's academic record, and possibly suspension or even expulsion from the program. The penalty will be non-negotiable, and will certainly not allow any recourse actions (e.g., a "do-over" or makeup assignment). For group work, all members will share the consequence of any tainted submissions.
Specifics of the official LSB Graduate Student Honor Code and the protocol for handling academic integrity violations are published at:
This course will involve extensive programming and computer work, both in and out of class. You are required to bring a laptop to class each time and and have the following software installed by the first class meeting:
Enthought Canopy Python Distribution: a scientific-oriented Python distribution from Enthought. This includes EPDFree, a free base scientific distribution (with NumPy, SciPy, matplotlib, Chaco, and Jupyter) and EPD Full, a comprehensive suite of more than 100 scientific packages across many domains. EPD Full is free for academic use but has an annual subscription for non-academic users.
A text editor for your platform:
In addition to the core Python language, we will cover some or all of the following open source packages for data science in Python.
Core:
Visualization:
Modeling and prediction:
Development:
Web related technologies:
We will be meeting every Saturday starting July 16th to August 27th, 2016. Please note that we are NOT meeting in-person on July 09 nor on September 03.
Here is our latest guide to the topics in the class -- subject to changes based on feedback throughout the class: