The scientific process:
"Turning data into conclusions" broadly refers to the bold items.
This means that:
We will focus on how to think about data analysis as much as how to execute it.
It's perfectly fine to use this class to advance your current research project. (Or not.)
We have some guidelines for keeping interactions positive and productive. Read them over here. In short:
All of our course materials are hosted on the public web by GitHub.
https://github.com/KIPAC/StatisticalMethods
Let's have a quick look there.
Note that the website displays a file listing (including subfolders) and a README file for the current folder.
Note the helpful "Information" links in the top-level README.
The easiest way to find lecture notes, tutorials, etc. is through the schedule document.
Clicking the links here will display various notebooks in a non-interactive setting in your browser.
To edit and run code in the notebooks, which we will do in the tutorials, you need to (optionally) fork the repository and (definitely) clone it to your own machine. See the Getting Started doc.
Homework and final projects will be distributed here, but your solutions will be submitted to a separate, private repository visible only to those enrolled in the class. See Information for Stanford Students.
For those not familiar with Git and GitHub, our first tutorial (next class) will step through this process.
We will make announcements through the issues of the private repo, so it's critical that you get added as a collaborator and Watch
the repo.
We also have a Slack team, which is a potentially useful way of communicating with your classmates and/or us.
If you haven't already, get yourself invited to these things by submitting this form (Stanford authentification).
There isn't textbook for this class, although we try to list useful things for extra reading in each lesson.
The course GitHub docs list some useful references, a couple of which are free downloads.
If you're interested in having one physical reference on your shelf, we recommend Bayesian Data Analysis by Gelman et al. (though note that it is not aimed at physicists).
Questions on anything so far?
With your neighbor,
We'll re-convene in a few minutes to get an idea of the class's collective previous experience.
Hopefully, we determined that astrophysical data analysis come in a variety of forms, and that we're interested in several broad classes of hypotheses.
Types of data | Things to learn about | ||
---|---|---|---|
1 | Imaging | a | Invididual objects |
2 | Spectroscopy | b | Populations |
3 | Time series | c | The Universe |
4 | Catalog | d | Fundamental physics |
... | ... | ... | ... |
As for statistics, it would be cliche to now instruct you to forget everything you know... or think you know. Do so anyway (for the moment).
We'll be starting from first principles.
ii) Any conclusions we draw must therefore incorporate uncertainty
This means we should describe both the data and conclusions in the language of mathematical probability.
Our conclusion will take the form: the probability that something is true in light of (given) the data we collected.
$p(\mathrm{thing}|\mathrm{data})$
By the basic laws of probability, this can be written
$p(\mathrm{thing}|\mathrm{data}) = \frac{p(\mathrm{data}|\mathrm{thing}) p(\mathrm{thing})}{p(\mathrm{data})}$
We'll unpack this much more later, but importantly it means that
iii) There is a correct answer
Just like in physics, the theory tells us the solution. The challenge is in evaluating it.
Within this framework,
iv) Data are constants
Even though they are generated randomly by the Universe, data that we have already collected are fixed numbers.
Much of our job boils down to building a model that predicts (probabilistically) what data we might have gotten.
v) Things we don't know with perfect precision can be mathematically described as "random"
That is, we use probabilities to model things that are uncertain, even if they are not truly random.
Again,
In the coming classes, we will cover