Lecture 1: Introduction and probability review

Similar documents
5.2 Fisher information and the Cramer-Rao bound

Lecture 10: Generalized likelihood ratio test

Lecture 3: Random variables, distributions, and transformations

Probability. Lecture Notes. Adolfo J. Rumbos

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 10

Sample Spaces, Random Variables

Part 3: Parametric Models

Probability and Probability Distributions. Dr. Mohammed Alahmed

The random variable 1

6.262: Discrete Stochastic Processes 2/2/11. Lecture 1: Introduction and Probability review

Random variables. DS GA 1002 Probability and Statistics for Data Science.

Probability theory for Networks (Part 1) CS 249B: Science of Networks Week 02: Monday, 02/04/08 Daniel Bilar Wellesley College Spring 2008

What is a random variable

1 Probability Review. 1.1 Sample Spaces

Discrete Mathematics and Probability Theory Fall 2014 Anant Sahai Note 15. Random Variables: Distributions, Independence, and Expectations

Discrete Mathematics and Probability Theory Fall 2013 Vazirani Note 12. Random Variables: Distribution and Expectation

Northwestern University Department of Electrical Engineering and Computer Science

Math 105 Course Outline

Random Variables. Definition: A random variable (r.v.) X on the probability space (Ω, F, P) is a mapping

Chapter 8: An Introduction to Probability and Statistics

Probabilities and Expectations

Midterm 2 Review. CS70 Summer Lecture 6D. David Dinh 28 July UC Berkeley

Brief Review of Probability

Today s Outline. Biostatistics Statistical Inference Lecture 01 Introduction to BIOSTAT602 Principles of Data Reduction

Disjointness and Additivity

1. When applied to an affected person, the test comes up positive in 90% of cases, and negative in 10% (these are called false negatives ).

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

Lecture 1: Probability Fundamentals

STAT2201. Analysis of Engineering & Scientific Data. Unit 3

Lecture 8: Conditional probability I: definition, independence, the tree method, sampling, chain rule for independent events

Problems from Probability and Statistical Inference (9th ed.) by Hogg, Tanis and Zimmerman.

Review of Probability Theory

MATH MW Elementary Probability Course Notes Part I: Models and Counting

Random Variables Example:

Single Maths B: Introduction to Probability

Lecture 2: Review of Probability

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

Bandits, Experts, and Games

Review of Basic Probability Theory

Lecture 1. ABC of Probability

Recitation 2: Probability

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

Statistics 100A Homework 5 Solutions

Introduction and Overview STAT 421, SP Course Instructor

University of Regina. Lecture Notes. Michael Kozdron

Stat 5101 Lecture Slides Deck 4. Charles J. Geyer School of Statistics University of Minnesota

Statistical Methods for the Social Sciences, Autumn 2012

GOV 2001/ 1002/ Stat E-200 Section 1 Probability Review

Discrete Mathematics for CS Spring 2006 Vazirani Lecture 22

Lecture 2: Repetition of probability theory and statistics

Random Variables. Statistics 110. Summer Copyright c 2006 by Mark E. Irwin

STAT 430/510 Probability Lecture 7: Random Variable and Expectation

Week 2: Review of probability and statistics

CS 246 Review of Proof Techniques and Probability 01/14/19

Lecture 3. Discrete Random Variables

Math 5a Reading Assignments for Sections

Discrete Mathematics and Probability Theory Fall 2012 Vazirani Note 14. Random Variables: Distribution and Expectation

Fourier and Stats / Astro Stats and Measurement : Stats Notes

University of California, Berkeley, Statistics 134: Concepts of Probability. Michael Lugo, Spring Exam 1

STAT/SOC/CSSS 221 Statistical Concepts and Methods for the Social Sciences. Random Variables

Probability theory basics

Math Review Sheet, Fall 2008

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 16. Random Variables: Distribution and Expectation

ECO 317 Economics of Uncertainty. Lectures: Tu-Th , Avinash Dixit. Precept: Fri , Andrei Rachkov. All in Fisher Hall B-01

Properties of Continuous Probability Distributions The graph of a continuous probability distribution is a curve. Probability is represented by area

Random Variables. Saravanan Vijayakumaran Department of Electrical Engineering Indian Institute of Technology Bombay

ACE 562 Fall Lecture 2: Probability, Random Variables and Distributions. by Professor Scott H. Irwin

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 18

ECE Lecture 4. Overview Simulation & MATLAB

Probability measures A probability measure, P, is a real valued function from the collection of possible events so that the following

Calculus II. Calculus II tends to be a very difficult course for many students. There are many reasons for this.

System Identification

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Intro to probability concepts

Probability Theory Review

Course: ESO-209 Home Work: 1 Instructor: Debasis Kundu

ECE Homework Set 2

Probability. VCE Maths Methods - Unit 2 - Probability

MATH2206 Prob Stat/20.Jan Weekly Review 1-2

Introduction to Stochastic Processes

Math 111, Introduction to the Calculus, Fall 2011 Midterm I Practice Exam 1 Solutions

Toss 1. Fig.1. 2 Heads 2 Tails Heads/Tails (H, H) (T, T) (H, T) Fig.2

Probability and Distributions

STAT200 Elementary Statistics for applications

Econ 325: Introduction to Empirical Economics

Part 3: Parametric Models

Math Bootcamp 2012 Miscellaneous

Infinite series, improper integrals, and Taylor series

Probability (Devore Chapter Two)

Lecture 2: Review of Basic Probability Theory

Continuous Distributions

Homework 2. Spring 2019 (Due Thursday February 7)

Summary of basic probability theory Math 218, Mathematical Statistics D Joyce, Spring 2016

You may hold onto this portion of the test and work on it some more after you have completed the no calculator portion of the test.

Random variable X is a mapping that maps each outcome s in the sample space to a unique real number x, < x <. ( ) X s. Real Line

Random variable X is a mapping that maps each outcome s in the sample space to a unique real number x, x. X s. Real Line

Deep Learning for Computer Vision

Lecture 1: August 28

2 Chapter 2: Conditional Probability

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Transcription:

Stat 200: Introduction to Statistical Inference Autumn 2018/19 Lecture 1: Introduction and probability review Lecturer: Art B. Owen September 25 Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They are meant as a memory aid for students who took stat 200 at Stanford University. They may be distributed outside this class only with the permission of the instructor. Also, Stanford University holds the copyright. Abstract These notes are mnemonics about what was covered in class. They don t replace being present or reading the book. Reading ahead in the book is very effective. 1.1 Introduction Welcome to stat 200, Introduction to statistical inference. The prerequisite is stat 116 (probability). After that you can follow up with any of a great many MS level statistics courses. Our text is Rice (2007). There is a course syllabus on the web. Same for lots of other details. Statistics began as the measure of the state. Rulers needed to know what they ruled. One of the oldest is a Babylonian tablet of agricultural productivity. There s an image in Stigler (2016). (That book is a fun read, but you are not expected to read it for this course.) Over time statistics has oscillated between the problems of too much data (gotta summarize) and too little data (gotta squeeze out all the info we can). At present both take place. Statistics is for learning about the world from data. We might get data like the hypothetical example below. If those 280 people got randomly assigned to either the drug or the placebo then it looks like it was Cured Not drug 100 40 placebo 80 60 better to get the drug. However we probably are more about the millions of other people who might get that drug (or not) in the future. So, does the result here generalize further? We have two problems. The data we saw are a sample, not the whole population we care about. Also, the data we saw could have been different. We handle both of those issues via probability. Probability is a branch of mathematics built upon other branches of mathematics such as the theory of functions. Applied statistics involves the use of statistical methods to solve real problems in biology, engineering, education, economics, commerce and so on. Theoretical statistics sits in between probability and applied statistics studying ideas that apply broadly. I believe that statistics is philosophy that we can back up with mathematics. Sometimes we must use computers to do that mathematics. The key issues we face are those of deciding what to do under uncertainty, deciding what exactly we mean by uncertainty in a given problem, choosing models and judging to what extent we know what we think we know. Sometimes the c Stanford University 2018

1-2 Lecture 1: Introduction and probability review mathematical challenges are about lengthy calculations, and very rarely they are involve finding that one great trick. Almost always it is about what way of formulating the problem makes the most sense. I mentioned the Knudsen hypothesis, that cancer is the result of damage to DNA. In a one-hit model the damage arrives at a time T 0 following a distribution with probability density function (PDF) f 1 (t) = θe θt. This is an exponential distribution. We will see more of it later. It describes the time to some unpredictable bolt from the blue. In a two-hit model the damage arrives at time T = max(t 1, T 2 ) where T j are independent exponential random variables with parameters θ j. Then Then the two-hit PDF is Pr(T t) = Pr(T 1 t, T 2 t) because max = Pr(T 1 t) Pr(T 2 t) independence = (1 e θ1t )(1 e θ2t ) plug in. f 2 (t) = d dt P r(t t) = = θ 1e θ1t + θ 2 e θ2t (θ 1 + θ 2 )e (θ1+θ2)t after some calculus. Statistical questions include 1. Do our data look more like one hit or two hit? 2. Even if we knew it was one hit, we would not usually know θ. So how would we estimate θ from data? 3. Surely there s more than one way to do that. Which one that we come up with is best? 4. Could there be a better estimate than the one we got, or is it the best possible? 5. How accurate is our estimate of θ? 6. How would we estimate (θ 1, θ 2 ) for two hit? 7. If we are planning this, how much data do we need to get? 1.2 Probability review We began to review ideas from probability. These should already be familiar but a refresher is in order. Also not everybody had the same probability course. We will do statistics from Chapters 8 and on of Rice, so Chapters 1 6 of Rice are very tuned to that. (Chapter 7 is on sampling finite populations; we will skip that. Stat 204 covers it.) We begin with a sample space Ω. This is a set of possible outcomes. The actual outcome is ω Ω. An event is a subset A Ω. We say that A occurs if and only if ω A. At this point, think back to examples you learned about dice or coins being tossed. For a single coin Ω = {H, T } and A = {H} is the event that the coin came up heads. Then probability is a function P on subsets 1 of Ω. Probability is a measure of how likely an event to happen. What we mean by likely is context dependent. It can be strength of our opinion about event A. It can come from enumerating equally likely outcomes, counting how many are in A and dividing by the number in Ω, when Ω has a finite number of elements. The coin tossing, dice tossing and card sampling events you have studied are of this type. Or it could be the long run frequency with which A occurs in a conceptually infinite set of replications. We write it as Pr(A). Whatever it means, the following rules apply: 1 Technical point that we will not use in this course: Sometimes Ω is complicated enough that only certain subsets can be assigned a probability.

Lecture 1: Introduction and probability review 1-3 1) Pr(Ω) = 1, 2) If A Ω then Pr(Ω) 0, and 3) If A i for i = 1, 2, 3,, I satisfy A i A j = whenever i j, then Pr( I i=1 A i) = I i=1 Pr(A I). Rule 3 also holds for infinitely many sets A i. From these rules we get Pr(A) 1, Pr(A c ) = 1 Pr(A), Pr( ) = 0 and Pr(A B) = Pr(A)+Pr(B) Pr(A B). Probability is a measure just like length, mass, area, volume and so on (except they don t have to satisfy rule 1). This is why Venn diagrams work so well for them. A critically important concept is conditional probability. If Pr(B) > 0 then we say that the probability of event A happening given that event B has happened is Pr(A B) = Pr(A B). Pr(B) We just say the probability of A given B. If we make a new function of events called q( ) with q( ) = Pr( B), meaning that q(a) = Pr(A B) for any A, then q is also a probability. It obeys the rules we have laid out above for Pr. We can even have q( C) for some other event C with q(c) > 0. Events A and B are independent if Pr(A B) = Pr(A) Pr(B). In applications, independence is a powerful assumption that greatly simplifies things. It is easy to show that independent events A and B satisfy Pr(A B) = Pr(A B c ) = Pr(A) though we need Pr(B) > 0 for the first one to be well defined and Pr(B) < 1 for the second one. 1.3 Random variables A random variable X is defined as a function on Ω. Then X(ω) is a real value. The event X A is the same as {ω X(ω) A}. We usually don t write X(ω) because that gets cumbersome, but the rules of probability for random variables come from their representation as functions on Ω. The cumulative distribution function of a random variable X is F (x) = Pr(X x), actually Pr({ω Ω X(ω) x) but that gets cumbersome. Here X is the random variable and x is one particular value it might take. So F (x) is a function on the real line. It satisfies lim x F (x) = 1 and lim x F (x) = 0 and F (x) F (x ) whenever x x (monotonicity). Some random variables are discrete, taking only values x {x 1, x 2,... } a set which may be finite or infinite. If X is discrete with Pr(X = x i ) = p(x i ) > 0 then F (x) has a jump of size p(x i ) at x i and is continuous from the right. It may resemble a staircase when graphed. Other random variables are continuous. They have a probability density function (PDF) f(x). For such an rv and a < b, Pr(a < X < b) = b a f(x) dx. An important example for us is the normal distribution with mean µ and variance σ 2 > 0. It has PDF f(x) = 1 2πσ e 1 2( x µ σ ) 2, < x <. The standard normal distribution has µ = 0 and σ = 1. A PDF must satisfy f(x) dx = 1. If we knew that we wanted a PDF proportional to e x2 /2, that is f(x) = ce x2 /2 for some constant c > 0, then we would have had to solve an integral to realize that the constant of proportionality is c = 1/ 2π. Very often we do the reverse. Knowing the form of the normal PDF allows us to know that e x2 /2 dx = 2π.

1-4 Lecture 1: Introduction and probability review If you replaced the standard normal PDF by f(x) = { 1 2π e x2 /2 x 11 42 x = 11, then the resulting distribution would be unchanged. For instance F (x) would not change and neither would Pr(a < X < b) for any a < b. You could even change the PDF f at any finite or countable number of points without changing the distribution. We won t be doing that. Also for a continuous random variable Pr(a < X < b) = Pr(a < X b) = Pr(a X < b) = Pr(a X b). If X is a random variable and Y = g(x) then Y is also a random variable. The distribution of Y can in principle be found from that of X, and in special cases the math simplifies enough that we can do it explicitly. For instance, let X U(0, 1) and Y = log(x). Then X has the uniform distribution on the interval (0, 1) with PDF 0 x < 0 f X (x) = 1 0 x 1 0 x > 1. and CDF 0 x < 0 F X (x) = x 0 x 1 1 x > 1. We want the PDF of Y which we call f Y to keep it separate from f X. First we find F Y in slow motion, keeping an eye out for X vs x vs Y vs y. Because 0 < X < 1 we have 0 < Y <. So for y > 0, Next More precisely F Y (y) = Pr(Y y) = Pr( log(x) y) = Pr(log(X) y) = Pr(X e y ) = 1 Pr(X < e y ) = 1 Pr(X e y ) = 1 F X (e y ) = 1 e y. f Y (y) = d dy F y(y) = e y. f Y (y) = { 0 y 0 e y y > 0. We certainly don t want to have f Y ( 3) = e 3, so handling cases is important for a complete description of a PDF. We could have ended up with f Y (0) = e 0 = 1 instead of f Y (0) = 0. It would be the same distribution. If g is an increasing and invertible function then F Y (y) = Pr(Y y) = Pr(g(X) y) = Pr(X g 1 (y)) = F X (g 1 (y)). We also reviewed expectation of random variables but that will be folded into the next scribing.

Lecture 1: Introduction and probability review 1-5 1.4 Coda Here s a fun quote from Larry Wasserman of Carnegie-Mellon. Students who analyze data, or who aspire to develop new methods for analyzing data, should be well grounded in basic probability and mathematical statistics. Using fancy tools like neural nets, boosting, and support vector machines without understanding basic statistics is like doing brain surgery before knowing how to use a band-aid. Larry Wasserman At the time he wrote, support vector machines were a much more prominent way to do statistical machine learning than they are now. There is a lot to be said for using the most powerful and sophicated methods around. There is also a lot to be said for understanding exactly what you re doing top to bottom, knowing the use cases, assumptions, gotchas and special cases. References Rice, J. A. (2007). Mathematical Statistics and Data Analysis. Brooks/Cole, Belmont, CA. Stigler, S. M. (2016). The seven pillars of statistical wisdom. Harvard University Press.