intro-data-science


Introduction to Data Science

What is a Data Scientist?

Data Scientists are people with some mix of coding and statistical skills who work on making data useful in various ways.

Type A Data Scientist: The A is for Analysis. This type is primarily concerned with making sense of data or working with it in a fairly static way.The Type A Data Scientist is very similar to a statistician (and may be one) but knows all the practical details of working with data that aren't taught in the statistics curriculum: data cleaning, methods for dealing with very large data sets, visualization, deep knowledge of a particular domain, writing well about data, and so on.

Type B Data Scientist: The B is for Building. Type B Data Scientists share some statistical background with Type A, but they are also very strong coders and may be trained software engineers. The Type B Data Scientist is mainly interested in using data "in production." They build models which interact with users, often serving recommendations (products, people you may know, ads, movies, search results).

We'll mainly be focusing on Type A in this course.

What is a Data Scientist (continued)?

Source: Tweet | Josh Wills - Data Scientist and Apache Crunch committer

Josh is also known for pithy data science quotes, such as: “I turn data into awesome”.

What is a Data Scientist (continued)?

Hadley Wickham's advice on becoming a Data Scientist

Statistical knowledge

I think you need some knowledge of specific statistical/machine learning techniques, but a deep theoretical understanding is not that important. You need to understand the strengths and weaknesses of each technique, but you don't need a deep theoretical understanding. The vast majority of data science problems can be solved by a creative assembly of off-the-shelf techniques, and don't require new theory.

Programming Skills

You need to be fluent with either R or python. There are other options, but none of them have the community that R and python have, which means you'll need to spend a lot of time reinventing tools that already exist elsewhere. Obviously, I prefer R, and unlike what some people claim it is a well founded programming language that is well tailored for its domain.

Domain knowledge

This obviously depends on the domain, but as a data scientist should be able to contribute meaningfully to any project, even if you're not intimately familiar with the specifics. I think this means you should be generally well read (e.g. at the level of New Scientist for the sciences) and an able communicator. A good data scientist will help the real domain experts refine and frame their questions in a helpful way. Unfortunately I don't know of any good resources for learning how to ask questions.

Data Science Workflow

Source: General Assembly's Data Science 2.0 Curriculum