Statistical Methods in Astrophysics


Physics 366


Instructors: Adam Mantz & Phil Marshall

About this course

Goals:

  • Explain what the course is and how it works
  • Provide a brief overview of what's coming

Who we are

  • Adam Mantz, Staff Scientist, Stanford/KIPAC
  • Phil Marshall, Staff Scientist, SLAC/KIPAC

What the course is

In a nutshell, this class is about how data are turned into conclusions.

The examples and problems are taken from astrophysics, but otherwise the content is extremely general.

The scientific process:

  • Propose observations
  • Collect and "reduce" data
  • Explore and summarize the data
  • Hypothesize and test
  • Interpret, conclude, speculate
  • Report

"Turning data into conclusions" broadly refers to the bold items.

Our goals (for you)

  • Develop familiarity in working with various types of (astronomical) data
  • Understand the key role of modeling in data analysis
  • Be able to critically evaluate and apply commonly used statistical inference methodologies and advanced statistical reasoning to problems you are likely to encounter in your research.

This means that:

  • We will focus on how to think about data analysis as much as how to execute it.

  • It's perfectly fine to use this class to advance your current research project. (Or not.)

What we'll do in class

  • lecture
  • active learning exercises
  • tutorials

Exercises primarily involve thinking and scribbling.

Tutorials are longer problems involving computer work.

Both will be done in pairs/groups.

What you'll do outside of class

  • Homework assignments
  • Final project

Homeworks will be set weekly for the first 7-8 weeks.

Projects will involve a short presentation in the final week of class and a report due at the end of finals week.

Both can be done in pairs/groups.

Grading

  1. Class participation: 33%
  2. Homework assignments: 33%
  3. Final project: 33%
  4. Random coin flip: 1%

Exhortations

  • Be in class
  • Do the work

You will not learn or retain nearly as much otherwise.

(Exceptions can be made for necessary travel etc. Let us know ahead of time.)

Classroom climate

We have some guidelines for keeping interactions positive and productive. Read them over here. In short:

  • Please don't use electronic devices during lecture.
  • Questions are good.
  • Be nice.

How it works

All of our course materials are hosted on the public web by GitHub.

https://github.com/KIPAC/StatisticalMethods

Let's have a quick look there.

Note that the website displays a file listing (including subfolders) and a README file for the current folder.

Note the helpful "Information" links in the top-level README.

The easiest way to find lecture notes, tutorials, etc. is through the schedule document.

Clicking the links here will display various notebooks in a non-interactive setting in your browser.

To edit and run code in the notebooks, which we will do in the tutorials, you need to (optionally) fork the repository and (definitely) clone it to your own machine. See the Getting Started doc.

Homework and final projects will be distributed here, but your solutions will be submitted to a separate, private repository visible only to those enrolled in the class. See Information for Stanford Students.

For those not familiar with Git and GitHub, our first tutorial (next class) will step through this process.

Communication

We will make announcements through the issues of the private repo, so it's critical that you get added as a collaborator and Watch the repo.

We also have a Slack team, which is a potentially useful way of communicating with your classmates and/or us.

If you haven't already, get yourself invited to these things by submitting this form (Stanford authentification).

Auditing

Auditors are welcome.

Expectations in class are the same as for students. If you're not traveling or otherwise unavoidably engaged, you will be here and you will participate. No freeloaders.

Books

  • There isn't textbook for this class, although we try to list useful things for extra reading in each lesson.

  • The course GitHub docs list some useful references, a couple of which are free downloads.

  • If you're interested in having one physical reference on your shelf, we recommend Bayesian Data Analysis by Gelman et al. (though note that it is not aimed at physicists).

Questions on anything so far?

Exercise: What do you already know?

With your neighbor,

  1. Briefly introduce yourselves and say what your research project/interests are.
  2. Jot down some notes encapsulating "what you already know" about statistical data analysis (purpose, methodology, ...). This is intentionally a vague question; most exercises later on will be more concrete.

We'll re-convene in a few minutes to get an idea of the class's collective previous experience.

Hopefully, we determined that astrophysical data analysis come in a variety of forms, and that we're interested in several broad classes of hypotheses.

Types of data Things to learn about
1 Imaging a Invididual objects
2 Spectroscopy b Populations
3 Time series c The Universe
4 Catalog d Fundamental physics
... ... ... ...

As for statistics, it would be cliche to now instruct you to forget everything you know... or think you know. Do so anyway (for the moment).

We'll be starting from first principles.

Some key ideas

i) All data we collect include some degree of randomness

ii) Any conclusions we draw must therefore incorporate uncertainty

This means we should describe both the data and conclusions in the language of mathematical probability.

Our conclusion will take the form: the probability that something is true in light of (given) the data we collected.

$p(\mathrm{thing}|\mathrm{data})$

By the basic laws of probability, this can be written

$p(\mathrm{thing}|\mathrm{data}) = \frac{p(\mathrm{data}|\mathrm{thing}) p(\mathrm{thing})}{p(\mathrm{data})}$

We'll unpack this much more later, but importantly it means that

iii) There is a correct answer

Just like in physics, the theory tells us the solution. The challenge is in evaluating it.

Within this framework,

iv) Data are constants

Even though they are generated randomly by the Universe, data that we have already collected are fixed numbers.

Much of our job boils down to building a model that predicts (probabilistically) what data we might have gotten.

v) Things we don't know with perfect precision can be mathematically described as "random"

That is, we use probabilities to model things that are uncertain, even if they are not truly random.

Again,

  • There is a correct answer
  • Unknowns, including potential data and our conclusions, are (mathematically) random
  • Collected data are constants

In the coming classes, we will cover

  1. Fundamentals of probability theory (a brief review)
  2. Bayesian analysis, and how it relates to other approaches
  3. Algorithmic tools used for Bayesian inference
  4. Advanced modeling strategies
  5. Ways to evaluate and compare models
  6. Other topics, as time permits