STAT2201 Assignment 3 Semester 1, 2017 Due 13/4/2017

Similar documents
UQ, Semester 1, 2017, Companion to STAT2201/CIVL2530 Exam Formulae and Tables

Statistics Assignment 2 HET551 Design and Development Project 1

STAT2201 Assignment 6 Semester 1, 2017 Due 26/5/2017

Statistical methods. Mean value and standard deviations Standard statistical distributions Linear systems Matrix algebra

Experiment 1: Linear Regression

EXAMINATIONS OF THE HONG KONG STATISTICAL SOCIETY

Problem Set 1. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 20

ENGR Spring Exam 2

Chapter 5 Joint Probability Distributions

Closed book and notes. 60 minutes. Cover page and four pages of exam. No calculators.

2. Variance and Covariance: We will now derive some classic properties of variance and covariance. Assume real-valued random variables X and Y.

Course: ESO-209 Home Work: 1 Instructor: Debasis Kundu

Business Statistics Midterm Exam Fall 2015 Russell. Please sign here to acknowledge

Lecture 25: Review. Statistics 104. April 23, Colin Rundel

Qualifying Exam CS 661: System Simulation Summer 2013 Prof. Marvin K. Nakayama

FINAL EXAM: Monday 8-10am

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

Phonon dispersion relation and density of states of a simple cubic lattice

FINAL EXAM: 3:30-5:30pm

Introductory Statistics with R: Simple Inferences for continuous data

Introduction to Basic Statistics Version 2

First steps of multivariate data analysis

Determining the Spread of a Distribution

CORRELATION AND REGRESSION

Determining the Spread of a Distribution

Bayesian Analysis - A First Example

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

This exam is closed book and closed notes. (You will have access to a copy of the Table of Common Distributions given in the back of the text.

EE/CpE 345. Modeling and Simulation. Fall Class 9

STAT:2020 Probability and Statistics for Engineers Exam 2 Mock-up. 100 possible points

ECE Homework Set 2

Summary of basic probability theory Math 218, Mathematical Statistics D Joyce, Spring 2016

Lecture 2: Review of Probability

Introduction to Probability and Stocastic Processes - Part I

This does not cover everything on the final. Look at the posted practice problems for other topics.

AP Final Review II Exploring Data (20% 30%)

Physics with Matlab and Mathematica Exercise #1 28 Aug 2012

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Introduction to bivariate analysis

(Re)introduction to Statistics Dan Lizotte

STA 2201/442 Assignment 2

Solutions - Homework #1

Semester , Example Exam 1

CS 147: Computer Systems Performance Analysis

Introduction to bivariate analysis

Statistics for scientists and engineers

Introduction to MatLab

RYERSON UNIVERSITY DEPARTMENT OF MATHEMATICS MTH 514 Stochastic Processes

Naive Bayes & Introduction to Gaussians

Joint Probability Distributions and Random Samples (Devore Chapter Five)

Random vectors X 1 X 2. Recall that a random vector X = is made up of, say, k. X k. random variables.

Continuous random variables

PhysicsAndMathsTutor.com. Advanced/Advanced Subsidiary. You must have: Mathematical Formulae and Statistical Tables (Blue)

Some Assorted Formulae. Some confidence intervals: σ n. x ± z α/2. x ± t n 1;α/2 n. ˆp(1 ˆp) ˆp ± z α/2 n. χ 2 n 1;1 α/2. n 1;α/2

SLR output RLS. Refer to slr (code) on the Lecture Page of the class website.

INSTITUTE OF ACTUARIES OF INDIA

Summary statistics. G.S. Questa, L. Trapani. MSc Induction - Summary statistics 1

Learning Objectives for Stat 225

An Introduction to MatLab

STAT Chapter 5 Continuous Distributions

Simple Linear Regression

STA2603/205/1/2014 /2014. ry II. Tutorial letter 205/1/

Math 494: Mathematical Statistics

Math 50: Final. 1. [13 points] It was found that 35 out of 300 famous people have the star sign Sagittarius.

Covariance. Lecture 20: Covariance / Correlation & General Bivariate Normal. Covariance, cont. Properties of Covariance

EE/CpE 345. Modeling and Simulation. Fall Class 10 November 18, 2002

Statistics for Data Analysis. Niklaus Berger. PSI Practical Course Physics Institute, University of Heidelberg

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2011

Simulating Realistic Ecological Count Data

are the objects described by a set of data. They may be people, animals or things.

Chapter 5 continued. Chapter 5 sections

Stat 704 Data Analysis I Probability Review

EEL 5544 Noise in Linear Systems Lecture 30. X (s) = E [ e sx] f X (x)e sx dx. Moments can be found from the Laplace transform as

Determination of Density 1

18 Bivariate normal distribution I

The assumptions are needed to give us... valid standard errors valid confidence intervals valid hypothesis tests and p-values

STAT515, Review Worksheet for Midterm 2 Spring 2019

Chapter 10. Correlation and Regression. McGraw-Hill, Bluman, 7th ed., Chapter 10 1

Problem 1 (20) Log-normal. f(x) Cauchy

Stat 101 L: Laboratory 5

Can you predict the future..?

1 Probability Review. 1.1 Sample Spaces

Class 8 Review Problems solutions, 18.05, Spring 2014

Topic 2 Part 1 [195 marks]

EE 302: Probabilistic Methods in Electrical Engineering

IE 361 Exam 1 October 2004 Prof. Vardeman

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

Mathematics 375 Probability and Statistics I Final Examination Solutions December 14, 2009

Asymptotic Statistics-III. Changliang Zou

Chapter 2. Continuous random variables

Bivariate data data from two variables e.g. Maths test results and English test results. Interpolate estimate a value between two known values.

ECE 5615/4615 Computer Project

Chapter 3 sections. SKIP: 3.10 Markov Chains. SKIP: pages Chapter 3 - continued

CS 5014: Research Methods in Computer Science. Bernoulli Distribution. Binomial Distribution. Poisson Distribution. Clifford A. Shaffer.

MAS223 Statistical Inference and Modelling Exercises

Lecture Notes for BUSINESS STATISTICS - BMGT 571. Chapters 1 through 6. Professor Ahmadi, Ph.D. Department of Management

Multivariate Distributions

Bivariate Distributions. Discrete Bivariate Distribution Example

In this course, we will be examining large data sets coming from different sources. There are two basic types of data:

Statistics I Exercises Lesson 3 Academic year 2015/16

Transcription:

Class Example 1. Single Sample Descriptive Statistics (a) Summary Statistics and Box-Plots You are working in factory producing hand held bicycle pumps and obtain a sample of 174 bicycle pump weights in grams, as in the data file Pumps.csv. Carry out descriptive statistics as follows: (i) Load the data file. using DataFrames data = readtable (" Pumps. csv ") (ii) Output the sample size, sample mean and sample standard deviation. ( length ( data [1]), mean ( data [1]), std ( data [1])) (iii) View basic summary statistics. using StatsBase summarystats ( data [1]) (iv) Create a box-plot of the data and comment on the structure of the data. using PyPlot PyPlot. boxplot ( data [1]) (v) Present the summary statistics and the box-plot of the data in a neatly formatted Julia notebook describing the data set. 1 of 11

(b) Empirical Cumulative Distribution Function (i) Key in the code below and run it. using Distributions, StatsBase, PyPlot zrv = Normal (20,5) numsamples = 15 data = rand (zrv, numsamples ) estimatedcdf = ecdf ( data ) grid = 0:0.1:40 PyPlot. plot (grid, estimatedcdf ( grid )) PyPlot. plot (grid, cdf (zrv, grid )) Figure 1: The resulting CDF of a theoretical Normal distribution with µ = 20 and σ = 5 plotted against an ECDF of randomly generated samples from the same distribution. (ii) Now modify the number of samples seeing how the ECDF converges to the true CDF as the number of samples increases. (iii) Change the distribution (zrv) to a distribution of your choice, e.g. Exponential or Uniform. Now run again to view the resulting graphs. Note that you may need to change the values of grid, appropriately. (c) Histograms and Kernel Density Estimation. The code below generates random variables from a mixture of two normal distributions with pdf represented by the blue curve below (left). It then plots a kernel density estimate of the data (red) and a histogram of the data. Figure 2: On left: Blue is theoretical pdf and red is a kernel density estimate (deps on data). On Right: A simple histogram. 2 of 11

Key in this code and experiment with increasing the number of samples. You may also try changing other parameters such as the means, variances and p. using Distributions, KernelDensity, PyPlot mu1 = 10; sigma1 =5 mu2 = 40; sigma2 =12 z1 = Normal (mu1, sigma1 ) z2 = Normal (mu2, sigma2 ) p = 0.3 # mixrv is a mixture of the Normal z1 and the Normal z2. function mixrv () ind = ( rand () <= p) if ind return rand (z1) else return rand (z2) # The pdf of a mixture is a mixture of the pdfs function actualpdf ( x) return p* pdf (z1,x) + (1 -p)* pdf (z2,x) numsamples = 100 data = [ mixrv () for _ in 1: numsamples ] grid = linspace ( -20,80,1000) # Plot the actual pdf ( in blue ) against the estimated KDE ( in red ) subplot ( 121) pdfactual = actualpdf.( grid ) PyPlot. plot (grid, pdfactual," blue ") kdedist = kde ( data ) pdfkde = pdf ( kdedist, grid ) PyPlot. plot (grid, pdfkde, " red ") # for comparison, plot a histogram subplot ( 122) PyPlot. plt [: hist ]( data,15, normed =" True ") 3 of 11

Class Example 2. The Correlation Coefficient and Bivariate Normal Distribution Consider a dataset consisting of the daily maximum temperature (in C) recorded at Brisbane and Gold Coast each day from the start of 2015 to the middle of Febuary 2017. The data is from the Australian Bureau of Meteorology, http://www.bom.gov.au/. Denote the observations for Brisbane and Gold Coast (respectively) by: x 1,..., x n and y 1,..., y n. Here n = 777. We will now perform the following: (a) Load the data file and ensure you have the right dimensions. (b) Calculate the following from the data: (i) The sample means for temperatures at both locations. (ii) The sample standard deviations for temperatures at both locations. (iii) The correlation coefficient estimate. (c) Use these estimates to describe the nature of the data in about 3 lines. (d) Draw a scatter plot of the data. (e) Assume the data comes from a bivariate Normal distribution. Solution: (i) Draw a contour plot diagram of the estimated distribution. (ii) Draw a 3D graph of the estimated distribution. (iii) Write an expression for the probability of having a day in Brisbane with temperature less than 30 degrees and temperature in Gold Coast greater than 25 degrees. (a) Load the.csv file as follows: data = readcsv (" BrisGCtemp. csv ") Then noticing that the data points begin on the second row and the respective temperatures are in the 4 th and 5 th columns, we extract arrays for Brisbane and Gold Coast as follows: Bris = convert ( Array { Float64,1}, data [2:,4]) GC = convert ( Array { Float64,1}, data [2:,5]) length ( Bris ), length (GC) (b) (i) The formulas for calculating the sample mean are as follows: n n x = mean ( Bris ), mean (GC) x i n and y = y i n (ii) The sample variance is calculated using: n (x i x) 2 s 2 = n 1 = n x 2 i n x2 n 1 and the sample standard deviation, s, is the positive square root of the sample variance. We calculate this here using both the formula and the built-in Julia std() function. 4 of 11

sqrt ( ( sum ([x^2 for x in Bris ]) - length ( Bris ) * mean ( Bris )^2) /( length ( Bris ) -1)), std ( Bris ), sqrt ( sum ([(y- mean (GC ))^2 for y in GC ]) / ( length (GC ) -1)), std (GC) (iii) The correlation coefficient estimate is calculated using: r xy = In Julia we use the following: cor (Bris,GC) n (y i y)(x i x) [ n (y i y) 2 n ] 1/2 (x i x) 2 (c) We see that the estimated mean temperature in Brisbane is 27.15 and the estimated mean temperature in the Gold Coast is 26.16. These are with corresponding sample standard deviations 4.02 and 3.52. The correlation coefficient is 0.92 hinting that on days where the temperature is high in Brisbane, it is more likely to be high in the Gold Coast, and vice versa. Note: Since the data is daily data over a long period, it is very plausible that it encompasses seasonal effects. Hence when considering the variation of the data (standard deviations of around 4) we need to be cognisant of the fact that much of the variation is perhaps due to seasonal effects. Separating the seasonal effects and random effects, is a matter of further study. (d) We can visualise the data using a scatter plot: using PyPlot PyPlot. plot (Bris,GC,".", markersize =2) xlabel (" Brisbane Daily Max Temp (C)") ylabel (" Gold Coast Daily Max Temp (C)"); Figure 3: Brisbane Max Temp vs Gold Coast Max Temp (e) Understanding the code below is beyond the scope of the course. Nevertheless, it is good to see as it provides a neat representation of the data together with the fitted bivariate Normal Distribution, answering (i)-(ii). 5 of 11

using PyPlot function bivariategauss ( x, mu, sigma ) return ((1 / (2 pi )) * det ( inv ( sigma ))^(1/2)) * exp (- (x - mu) * inv ( sigma ) * (x - mu ))[1] n = 200 mu = [ mean ( Bris ); mean (GC )] sigma = [ std ( Bris )^2 cor (Bris,GC )* std (GC )* std ( Bris ); cor (Bris,GC )* std (GC )* std ( Bris ) std (GC )^2] x = linspace (15, 40, n) y = linspace (15, 40, n) xgrid = repmat (x, n, 1 ) ygrid = repmat (y, 1, n ) grid = [ xgrid [:] ygrid [:]] z = [ bivariategauss ( grid [:,i], mu, sigma ) for i in 1: size (grid, 2)] zgrid = reshape (z, n, n) PyPlot. figure (" Contour plot ") xlim (15,40) ylim (15,40) CS = PyPlot. contour ( xgrid, ygrid, zgrid, levels =[0.0001,0.001, 0.005, 0.02, 0.03]) PyPlot. clabel (CS, inline =1, fontsize =10) PyPlot. plot (Bris, GC, ".", markersize =2, label =" scatter ") xlabel (" Brisbane Daily Max Temp (C)") ylabel (" Gold Coast Daily Max Temp (C)") title (" Contour Plot ") z = zeros (n,n) for i in 1: n for j in 1: n z[i,j] = bivariategauss ([x[i],y[j]], mu, sigma ) fig = figure (" pyplot_surfaceplot ",figsize =(5,10)) ax = fig [: add_subplot ](2,1,1, projection = "3d") ax [: plot_surface ]( xgrid, ygrid, z, rstride =2, edgecolors ="k", cstride =2, cmap = ColorMap (" gray "), alpha =0.8, linewidth =0. 25) xlabel (" Brisbane Daily Max Temp (C)") ylabel (" Gold Coast Daily Max Temp (C)") title (" Surface Plot ") 6 of 11

(a) Contour plot (b) 3D surface plot Figure 4: Brisbane Max Temp vs Gold Coast Max Temp (iii) Write an expression for the probability of having a day in Brisbane with temperature less than 30 degrees and temperature in Gold Coast greater than 25 degrees. ( ) 30 P X < 30, Y > 25 = f X,Y (x, y; s x, s y, x, y, ˆρ) dx dy, x= y=25 where f X,Y ( ) is the bivariate normal density: 1 f XY (x, y; σ X, σ Y, µ X, µ Y, ρ) = 2πσ X σ Y 1 ρ 2 { [ ]} 1 (x µ X ) 2 exp 2(1 ρ 2 ) σx 2 2ρ(x µ X)(y µ Y ) + (y µ Y ) 2 σ X σ Y σy 2. 7 of 11

Class Example 3. Statistical Inference Ideas. A farmer wanted to test whether a new fertilizer was effective at increasing the yield of her tomato plants. She took 20 plants, kept 10 as controls and treated the remaining 10 with the new fertilizer. After two months, she harvested the plants and recorded the yield of each plant (in kg), as shown in the following table: Control 4.17 5.58 5.18 6.11 4.5 4.61 5.17 4.53 5.33 5.14 Fertilizer 6.31 5.12 5.54 5.5 5.37 5.29 4.92 6.15 5.8 5.26 From this data, the group of plants treated with the fertilizer had an average yield of 0.494 kg greater than the control group. One could argue that this difference is due to the effects of fertilizer. We will now investigate if this is a reasonable assumption. Let us assume for a moment that the fertilizer has no effect on plant yield (it is a placebo/has no added nutrients). In such a scenario, we actually have 20 observations from the same group (non-nutrient enriched plants), and the average difference is purely the result of random chance. We can investigate this, by taking all possible combinations of 10 samples from our group of 20 observations, and counting how many of these combinations results in a sample mean greater than the mean of our treatment group. Dividing this total by the total number of possible combinations, we can obtain a proportion, and hence likelihood, that the difference observed was due purely to random chance. First calculate the number of ways of sampling r = 10 unique items from a total of n = 20. This is given by, ( ) ( ) n n! 20 = r r!(n r)! = = 184, 756. 10 This number of possible samples is computationally managable, and hence we ll use Julia s combinatorics package to enumerate all 184, 756 combinations. using Combinatorics, DataFrames # Import data data = readtable (" Fertilizer. csv ") control = data [1] fertilizer = data [2] # Concantenate the data, and collect all 20 C10 combinations. x = collect ( combinations ([ control ; fertilizer ],10)) # Let s just make sure that the number of combinations is correct println (" Number of combinations : ",length (x)) # Construct a vector of means for each combination, and calculate the # proportion of times these means are >= than the mean of treated group. pvalue = sum ([ mean (_) >= mean ( fertilizer ) for _ in x ])/ length (x) We can see that from this data, only 2.39% of all possible combinations have a mean greater or equal to our treated group. Therefore there is significant statistical evidence that the fertilizer increases the yield. 8 of 11

Question 1. Joint Probability Mass Function Consider the function p XY (, ): Determine the following: x y p XY (x, y) 1.0 1.0 1/4 1.5 2.0 1/8 1.5 3.0 1/4 2.5 4.0 1/4 3.0 5.0 1/8 (a) Show that p X,Y is a valid probability mass function. (b) P (X < 2.5, Y < 3). (c) P (X < 2.5). (d) P (Y < 3). (e) P (X > 1.8, Y > 4.7). (f) E(X), E(Y ), V (X), V (Y ). (g) Are X and Y indepent random variables? (h) P (X + Y 4). Question 2. More Fun With Two Random Variables Let X and Y be indepent random variables with E(X) = 2, V (X) = 5, E(Y ) = 6, V (Y ) = 8. Determine the following: (a) E(3X + 2Y ). (b) V (3X + 2Y ). Assume now further to the above that X and Y are normally distributed and determine the following: (c) P (3X + 2Y < 18). (d) P (3X + 2Y < 28). (e) Verify (c) and (d) using Julia code, where for each case you generate a million X s and a million Y s and simulate the linear combination 3X + 2Y. (f) Assume now that the random variables come from another distribution (not Normal), but keep the same means and variances. Are your answers for (c) and (d) likely to change? How about your answers for (a) and (b)? (g) Assume now that X and Y are Normally distributed but are not indepent, but rather Cov(X, Y ) = 5. Write an explicit expression using a double integral for P (X < 2, Y > 7). 9 of 11

Question 3. Rise of the machines A semiconductor manufacturer produces devices used as central processing units in personal computers. The speed of the devices (in megahertz) is important because it determines the price that the manufacturer can charge for the devices. The file (6-42.csv) contains measurements on 120 devices. Construct the following plots for this data and comment on any important features that you notice. (a) Histogram. (b) Box-plot. (c) Kernel Density Estimate. (d) Empirical cumulative distribution function. Further, compute: (e) The sample mean, the sample standard deviation and the sample median. (f) What percentage of the devices has a speed exceeding 700 megahertz? Question 4. The thickest rod Eight measurements were made on the inside diameter of forged piston rings used in an automobile engine. The data (in millimetres) is: 74.001, 74.003, 74.015, 74.000, 74.005, 74.002, 74.005, 74.004. Use the Julia function below to construct a normal probability plot of the piston ring diameter data. Does it seem reasonable to assume that piston ring diameter is normally distributed? using PyPlot, Distributions, StatsBase function NormalProbabilityPlot ( data ) mu = mean ( data ) sig = std ( data ) n = length ( data ) p = [(i -0.5)/ n for i in 1:n] x = quantile ( Normal (),p) y = sort ([(i-mu )/ sig for i in data ]) PyPlot. scatter (x,y) xrange = maximum (x) - minimum (x) PyPlot. plot ([ minimum (x)- xrange /8, maximum (x) + xrange /8], [ minimum (x)- xrange /8, maximum (x)+ xrange /8], color =" red ",linewidth =0.5) xlabel (" Theoretical quantiles ") ylabel (" Quantiles of data "); return 10 of 11

Question 5. A non-flat earth In 1789, Henry Cavish estimated the density of the Earth by using a torsion balance. His 29 measurements are in the the file (6-122.csv), expressed as a multiple of the density of water. (a) Calculate the sample mean, sample standard deviation, and median of the Cavish density data. (b) Construct a normal probability plot of the data. Comment on the plot. Does there seem to be a low outlier in the data? (c) Would the sample median be a better estimate of the density of the earth than the sample mean? Why? Question 6. Normal Confidence Interval A normal population has a mean 100 and variance 25. How large must the random sample be if you want the standard error of the sample average to be 1.5? Question 7. Fill in the blanks A random sample has been taken from a normal distribution. Output from a software package follows: Variable N Mean SE Mean StDev Variance Sum x?? 1.58 6.11? 751.40 (a) Fill in the missing quantities. (b) Find a 95% CI on the population mean under the assumption that the standard deviation is known. Question 8. More on the randomization test Reproduce class example 3. Now modify the data so that the yield of the Fertilizer is decreased by exactly 0.5 kg per observation (i.e. the first observation is 5.81, the second is 4.62 and so fourth). What are the results now? How do you interpret them? 11 of 11