Python for Data Science

University: UCSanDiegoX
Platform: edX.org
Instructors: Ilkay Altintas and Leo Porter
Student: Jayme Anchante
(Start: 13:00 2017-07-13)

Introduction and Course Information

Course Staff:

  • Instructors:

Leo Porter: Assistant Teaching Professor in the Computer Science and Engineer Department at UC San Diego

Ilkay Altintas: Chief Data Science Officer at the San Diego Supercomputer Center (SDSC)

  • Teaching Assistants:

Alok Singh, Computational Data Science Research Specialist, San Diego Supercomputer Center
Andrea Zonca, HPC Applications Specialist, San Diego Supercomputer Center

To receive a MicroMasters Certificate in Data Science you must receive verified certificates in the following courses:

Python for Data Science
Statistics and Probability in Data Science using Python
Machine Learning for Data Science
Big Data Analytics using Spark


Course Outline

The course is broken into 10 weeks. The beginning of the course is heavily focused on learning the basic tools of data science, but we firmly believe that you learn the most about data science by doing data science. So the latter half of the course is a combination of working on large projects and introductions to advanced data analysis techniques.

Week 1 - Introduction:  Welcome and overview of the course.  Introduction to the data science process and the value of learning data science.
Week 2 - Background:  In this optional week, we provide a brief background in python or unix to get you up and running.  If you are already familiar with python and/or unix, feel free to skip this content.
Week 3 - Jupyter and Numpy:  Jupyter notebooks are one of the most commonly used tools in data science as they allow you to combine your research notes with the code for the analysis.  After getting started in Jupyter, we'll learn how to use numpy for data analysis.  numpy offers many useful functions for processing data as well as data structures which are time and space efficient.
Week 4 - Pandas:  Pandas, built on top of numpy,  adds data frames which offer critical data analysis functionality and features.
Week 5 - Visualization:  When working with large datasets, you often need to visualize your data to gain a better understanding of it. Also, when you reach conclusions about the data, you'll often wish to use visualizations to present your results.
Week 6 - Mini Project:  With the tools of Jupyter notebooks, numpy, pandas, and Visualization, you're ready to do sophisticated analysis on your own.  You'll pick a dataset we've worked with already and perform an analysis for this first project.
Week 7 - Machine Learning:  To take your data analysis skills one step further, we'll introduce you to the basics of machine learning and how to use sci-kit learn - a powerful library for machine learning.
Week 8 - Working with Text and Databases:  You'll find yourself often working with text data or data from databases.  This week will give you the skills to access that data.  For text data, we'll also give you a preview of how to analyze text data using ideas from the field of Natural Language Processing and how to apply those ideas using the Natural Language Processing Toolkit (NLTK) library.
Week 9 and 10 - Final Project:  These weeks let you showcase all your new skills in an end-to-end data analysis project.  You'll pick the dataset, do the data munging, ask the research questions, visualize the data, draw conclusions, and present your results.

(16:00 2017-07-13)

WEEK 1: Getting started with Data Science

1.1. Data Science: Generating Value from Data

1.1.1. Data Science: Getting Value out of Data

(10:50 2017-07-14)

From Big Data take Insights, from Insights take Actions! We make predictions out of data to direct actions. The recent increase in Data Science attention comes from two main facts: i) massive increase in data availability (internet, smartphones, GPS etc.); ii) increase in the computer processing capacity.

How much data is big data? In one minute, Facebook: 200k photos are uploaded and 1.8M likes are giving, YouTube: 2.78M video views, 72h of video uploads. According to EMC, in 2009 the 'digital world' had 0.8Zb and it will grow to 35.2Zb by 2020. "We are drowning in information and starving for knowledge" - John Naisbitt.

Modern Data Science Skills:

  • Programming in Python
  • Statistics
  • Machine Learning
  • Scalable Big Data Analysis

1.1.2. Why Python for Data Science

Data Science is in the intersection of Mathematics expertise, Technology (hacking skills) and Bussiness (acumen). Are data scientists are unicorns? Data science is a team sport. Data scientists: have passion for data, relate problems to analytics, care about engineering solutions, exhibit curiosity and communicate with teammates.

Why Python for Data Science? It is easy to read and learn, vibrant community, growing and evolving set of libraries, applicable to each step in the data science process, jupyter notebooks (reproducible, repeatable, transparet).

1.1.3. Case Study: Soccer Data Analysis

A Kaggle database from 25k european soccer matches and 10k players from 2008 to 2016. Our goals are: i) form meaningful player groups; ii) discover other players that are similar to your favorite athlete; and iii) form strong teams by using analytics.

Data Collection: Databases (relational and non-relational), Text files (.csv, .txt, .xls), Live feeds (sensors, twitter, weather).

df.describe().transpose()

Data cleaning: why (missings values, garbage values, NULLs), how do we clean (remove entries, impute these entries with a coutnerpart).

df.isnull().any().any(), df.shape  
df = df.dropna()

Data visualization

Analysis and modeling: supervised, unsupervised and semi-supervised learning. Feature selection.
Using KMeans:

from sklearn.cluster import KMeans
...
y = KMeans(n_clusters=3, random_state=random_state).fit_predict(X)
...

Present your findings. Support actions to coaches.

1.2. The Data Science Process

1.2.1. How Does Data Science Happen

Key dimensions of Data Science: a multidisciplinary craft that combines an interdisciplinary team with an application purpose. It starts with a team of people with a question (and of course some data to explore) and we come up with a data driven process to answer it (first a conceptual that defines the core set of steps to solve the question). There are mainly two processes: Data Engineering (acquire and prepare) and Computational Data Science (analyze, report and act)

1.2.2. Asking the Right Question

Define the problem -> Asses the situation -> Define goals

1.2.3. Steps in Data Science

Acquire -> Prepare -> Analyze -> Report -> Act

Acquire: identify datasets, retrieve data, query data
Prepare: explore, pre-process (clean, integrate, package)
Analyze: select analytical techniques, build models
Report: interpret, summarize, visualize and post-process
Act: determining actions

It is an iterative process!

1.2.4. Step 1: Acquiring Data

Traditional databases: SQL and query browsers
Text files and spreadsheet
Scripting languages: Python, Ruby, R, Octave, Matlab, Perl, PhP, JavaScript
Webpages: WWW and W3C. Format is XML and JSON. Many websites host services which provides programmatic access to their data, there are several types of web services, the most popular one is REST (Representational State Transfer), which is an approach with performance, scalability and maintainability. WebSocket sevices are also becoming popular. NoSQL storage systems are increasingly used to manage a variety of data types, it consists of data not represented in table format (with columns or rows), some examples are: Cassandra, mongoDB and HBASE.

1.2.5. Step 2A: Exploring Data

You might be tempted to build a model, but it is better to explore it, by seeing correlations, general trends, outliers. Summary statistics: mean, median, mode, range etc. Visualize it: heat maps, histograms, boxplots, line graphs, scatter plots etc.

1.2.6. Step 2B: Pre-processing Data

Transform it to make it ready for analysis. First we clean the data, then we transform it. Real-world data is messy: inconsistent values, duplicate records, missings values, invalid data and outliers. It includes: scaling (re-scale variables to range from 0 to 1, for example), transformation (aggregation, from daily to monthly, for example), feature selection (remove redundant, combine or add features), dimensionality reduction (one technique is Principal Component Analysis - PCA) and data manipulation (group observations, for example).

1.2.7. Step 3: Analyze Data

Categories of Analysis Techniques: Classification (predict the category), Regression (predict a numeric value). Clustering (organize similar items into groups), Association Analysis (find rules to capture associations between items) and Graph Analysis (use graph structures to find connection between entities).

Select technique -> Build model -> Validate model

1.2.8. Reporting Insights

Better communicate a clear yet inconclusive result than a clear incorrect story. Visualization is an important part of the report.

1.2.9. Turning Results into Actions

Monitor and measure action, and evaluate it.

1.2.10. Conclusion

Over the next 9 weeks we will Python!

1.3. Week 1: Assessment