Python Analysis. PHYS 224 October 1/2, 2015

Size: px

Start display at page:

Download "Python Analysis. PHYS 224 October 1/2, 2015"

Katrina Briggs
5 years ago
Views:

1 Python Analysis PHYS 224 October 1/2, 2015

2 Goals Two things to teach in this lecture 1. How to use python to fit data 2. How to interpret what python gives you Some references: ScipyScriptRepo/CurveFitting.ipynb 2

3 Fitting Experimental Data The goal of the lab experiments is to determine a physical quantity y (dependent variable) as a function of x (independent variable) How? Measure the pair (x i,y i ) a number (N) times Find a fit function y=y(x) that describes the relationship between these two quantities 3

4 The Linear Case The simplest function relating the two variables is the linear function f(x) = y = ax +b This is valid for any yi,xi combination If a and b are known, the true value of yi can be calculated for any xi yi,true = axi + b 4

5 Linear Regression Linear regression calculates the most probable values of a and b such that the linear equation is valid yi,true = axi + b When taking measurements of yi, these usually obey Gauss distribution 5

6 An Example Ideal Gas Law: P*V = n*r*t Pressure * Volume = n * R * Temperature P = [(n*r)/v]*t 6

7 Fitting in Python We re going to use the curve_fit function, which is part of the scipy.optimize package The usage is as follows: fit_parameters,fit_covariance = scipy.optimize.curve_fit(fit_function,x_data,y_data,sigma,guess) #fit_parameters - an array of the output fit parameters #fit_covariance - an array of the covariance of the output fit parameters #fit_function - the function used to do the fit #sigma - the uncertainty associated with the data #guess - the initial guess input to the fit 7

8 Fitting with curve_fit import numpy import scipy.optimize from matplotlib import peplos #define the function to be used in the fitting def linearfit(x,*p): return p[0]+p[1]*x #read in the data (currently only located on my hard drive...) temp_data, vol_data = numpy.loadtxt('ideal_gas_law.txt',unpack=true) #add an uncertainty to each measurement point uncertainty = numpy.empty(len(vol_data)) uncertainty.fill(20.) #do the fit fit_parameters,fit_covariance = scipy.optimize.curve_fit(linearfit, temp_data, vol_data, p0=(1.0,8.0),sigma=uncertainty) 8

9 } Fitting with curve_fit import numpy import scipy.optimize from matplotlib import peplos #define the function to be used in the fitting def linearfit(x,*p): return p[0]+p[1]*x #read in the data (currently only located on my hard drive...) temp_data, vol_data = numpy.loadtxt('ideal_gas_law.txt',unpack=true) #add an uncertainty to each measurement point uncertainty = numpy.empty(len(vol_data)) uncertainty.fill(20.) Function #do the fit fit_parameters,fit_covariance = scipy.optimize.curve_fit(linearfit, temp_data, vol_data, p0=(1.0,8.0),sigma=uncertainty) X data } Initial guess for parameters } Uncertainty on data } } Y data 9

10 Results fit parameters =[ ] fit covariance =[[ e e + 01] [ e e 01]] So what does this mean? We set up the function for the fit to be: y = p[0] + p[1]*x So with the fit parameters, the function is: y = *x 10

11 How did it do this? The function curve_fit uses a minimizer This varies the fit parameters ( p[0] and p[1] ) to see what value of these are most likely to fit the data properly This also depends on residuals, which are the difference between the result of the fit and the data at each point We ll discuss the quantity which is minimized in a bit 11

12 Probability The probability of any one point being from the fit is P a,b (y) = 1 p e 2 y (y a bx) y Where y is the measured data, a and b are from the fit 12

13 Full Probability For a set of N measurements of the dependent variable y y1, y2, y3, yn The probability of obtaining these values is the product of the individual probabilities P a,b (y 1,y 2,y 3...y N )=P a,b (y 1 )P a,b (y 2 )P a,b (y 3 )...P a,b (y N ) = 1 N y e P Ni=1 (y i a bx i ) 2 2 y 2 13

14 Full Probability For a set of N measurements of the dependent variable y y1, y2, y3, yn The probability of obtaining these values is the product of the individual probabilities P a,b (y 1,y 2,y 3...y N )=P a,b (y 1 )P a,b (y 2 )P a,b (y 3 )...P a,b (y N ) = 1 N y e P Ni=1 (y i a bx i ) 2 2 y 2 Called the chi-squared (χ 2 ) 14

15 Chi-Squared 2 = NX i=1 (y i a bx i ) 2 2 y The circled part is the definition of the residuals, ie the true data (y i ) minus the fit data (a + b*x i ) Dividing this by the standard deviation (σ) tells us how many standard deviations the test data is away from the fit at that x The square ensures this is always positive 15

16 Plotting the Residuals #read in the data (currently only located on my hard drive...) temp_data,vol_data = numpy.loadtxt('/users/kclark/desktop/teaching/phys224/weather_data/ ideal_gas_law.txt',unpack=true) #add an uncertainty to each measurement point uncertainty = numpy.empty(len(vol_data)) uncertainty.fill(20.) #do the fit fit_parameters,fit_covariance = scipy.optimize.curve_fit(linearfit,temp_data,vol_data,p0=(1.0,8.0),sigma=uncertainty) #now generate the line of the best fit #set up the temperature points for the full array fit_temp = numpy.arange(270,355,5) #make the data for the best fit values fit_answer = linearfit(fit_temp,*fit_parameters) #calculate the residuals fit_resid = vol_data-linearfit(temp_data,*fit_parameters) #make a line at zero zero_line = numpy.zeros(len(vol_data)) 16

17 How do the Residuals Look? The residuals are obviously a large component of the χ 2 value used by the minimizer They can be plotted to look for trends and see if the fit function is appropriate 17

18 Other Results from curve_fit curve_fit returns not only the best values for the parameters p[0] and p[1] The fit covariance matrix is also returned One strength of curve_fit is the ease of use of the fit covariance matrix 18

19 Interpreting the Covariance Matrix fit parameters =[ ] fit covariance =[[ e e + 01] [ e e 01]] Diagonal elements are the square of the standard deviation for that parameter The non-diagonal elements show the relationship between the parameters cov(x, y) = 1 N NX (x i x)(y i ȳ) i=1 19

20 Fit Results import numpy import scipy.optimize from matplotlib import pyplot #define the function to be used in the fitting, which is linear in this case def linearfit(x,*p): return p[0]+p[1]*x #read in the data (currently only located on my hard drive...) temp_data,vol_data = numpy.loadtxt('/users/kclark/desktop/teaching/phys224/weather_data/ ideal_gas_law.txt',unpack=true) #add an uncertainty to each measurement point uncertainty = numpy.empty(len(vol_data)) uncertainty.fill(20.) #do the fit fit_parameters,fit_covariance = scipy.optimize.curve_fit(linearfit,temp_data,vol_data,p0=(1.0,8.0),sigma=uncertainty) #determine the standard deviations for each parameter sigma0 = numpy.sqrt(fit_covariance[0,0]) sigma1 = numpy.sqrt(fit_covariance[1,1]) 20

21 Fit Results #do the fit fit_parameters,fit_covariance = scipy.optimize.curve_fit(linearfit,temp_data,vol_data,p0=(1.0,8.0),sigma=uncertainty) #determine the standard deviations for each parameter sigma0 = numpy.sqrt(fit_covariance[0,0]) sigma1 = numpy.sqrt(fit_covariance[1,1]) #calculate the mean fit result to plot the line fit_line = linearfit(temp_data,*fit_parameters) #calculate the residuals fit_residuals = vol_data - fit_line #calculate the data for the best fit minus one sigma in parameter #1 params_minus1sigma = numpy.array([fit_parameters[0],fit_parameters[1]-sigma1]) data_minus1sigma = linearfit(temp_data,*params_minus1sigma) #do some plotting of the results pyplot.errorbar(temp_data,vol_data,yerr=uncertainty,marker='o',ls='none') pyplot.plot(fit_temp,fit_answer,'b--') pyplot.plot(temp_data,data_plus1sigma,'g--',temp_data,data_minus1sigma,'g--') pyplot.title("ideal Gas Law Example") pyplot.xlabel("temperature (K)") pyplot.ylabel("pressure (Pa)") 21

22 Fit Results fit parameters =[ ] fit covariance =[[ e e + 01] [ e e 01]] Calculate the standard deviation on the slope (p[1]) This is the square root of the [1,1] entry of the covariance matrix 22

Fit Results fit parameters =[0.21617647 8.33058824] fit covariance =[[2.16490542e + 04 6.

23 Fit Results fit parameters =[ ] fit covariance =[[ e e + 01] [ e e 01]] Show the p[1] parameter with the standard deviation: p1 = 8.33 ±

24 Comparison to Accepted Values We obtained the result p[1] = 8.33±0.47 We assume that there is 1 mole in a 1m 3 volume so that n=v=1 The accepted value (currently) is ± The accepted value IS contained within our uncertainty (our one sigma range is from 7.86 to 8.80) These values agree within their error 24

25 Application to Non-linear Examples This method can also be applied to other examples Powers: y = b x can be linearized as y 2 = b 2 *x Polynomials: y = a + b*x + c*x 2 + d*x 3 This is just a case of using multiple regression since the equation is linear in the coefficients Exponentials: y= a*e bx Can be linearized as ln(y) = ln(a) + b*x There are many other examples 25

26 Return to Chi-Squared 2 = NX i=1 (y i y(x i )) 2 2 y Here the definition of the residual has changed Instead of y i - a - b*x i a more general term has been used y i is still the data y(x i ) is the fit function evaluated at x i 26

27 Gauss Distribution The probability is described by P (x) = 1 p 2 e (x x)2 2 2 where the average (mean) value is x and the spread in values is σ 27

Gauss Distribution We use the probabilities shown above to determine how probable a value is in this distribution When we take a measurement, we expect that 68.

28 Gauss Distribution We use the probabilities shown above to determine how probable a value is in this distribution When we take a measurement, we expect that 68.2% of the time it will be within 1σ from the mean value Another way of phrasing this is that we expect a value to be more than 3σ above the mean value only 0.1% of the time 28

29 Another example 29

30 Fitting the Gaussian import numpy import scipy.optimize import matplotlib.pyplot as pyplot import pylab as py #define the function to be used in the fitting, which is linear in this case def gaussfit(x,*p): return p[0]+p[1]*numpy.exp(-1*(x-p[2])**2/(2*p[3]**2)) #read in the data (currently only located on my hard drive...) day_num,rain_data = numpy.loadtxt('/users/kclark/desktop/teaching/phys224/weather_data/ precip_2013.txt', unpack=true) #get some (pretty good) guesses for the fitting parameters data_mean = rain_data.mean() data_std = rain_data.std() #set up the histogram so that it can be fit data_plot = py.hist(rain_data,range=(0.1,90),bins=100) histx = [0.5 * (data_plot[1][i] + data_plot[1][i + 1]) for i in xrange(100)] histy = data_plot[0] #actually do the fitting fit_parameters,fit_covariance = scipy.optimize.curve_fit(gaussfit,histx,histy,p0=(5.0,10.0,data_mean,data_std)) 30

31 Another example Mean Fit mean: 7.06mm Fit standard deviation: 10.13mm } Standard Deviation 31

32 Another example Mean Fit mean: 7.06mm Fit standard deviation: 10.13mm } Standard Deviation 32

33 Another example Mean Fit mean: 7.06mm Fit standard deviation: 10.13mm } Standard Deviation 33

34 Another example Mean Fit mean: 7.06mm Fit standard deviation: 10.13mm Rainfall of 85.5mm is 7.74 standard deviations above the mean (from this data) which is extremely } Standard Deviation unlikely 34

35 Chi-Squared and Goodness of Fit 2 = NX i=1 (y i y(x i )) 2 2 y This can then be used as a goodness of fit test If the function is a good approximation, then the residual will be within one standard deviation, so this will sum to approximately N 35

36 Chi-Squared 2 = NX i=1 (y i y(x i )) 2 2 y We normally use the number of degrees of freedom of the experiment to determine the fit quality The number of DOF is the number of data points in the sample minus the number of parameters in the fit For a sample with 20 data points and a linear fit (2 parameters), DOF = 18 This is used as the goodness of fit since χ 2 /DOF 1 for a good fit 36

37 Revisit the First Example import numpy import scipy.optimize from matplotlib import pyplot #define the function to be used in the fitting, which is linear in this case def linearfit(x,*p): return p[0]+p[1]*x #read in the data (currently only located on my hard drive...) temp_data,vol_data = numpy.loadtxt('/users/kclark/desktop/teaching/phys224/weather_data/ ideal_gas_law.txt',unpack=true) #add an uncertainty to each measurement point uncertainty = numpy.empty(len(vol_data)) uncertainty.fill(20.) #do the fit fit_parameters,fit_covariance = scipy.optimize.curve_fit(linearfit,temp_data,vol_data,p0=(1.0,8.0),sigma=uncertainty) #calculate the chi-squared value chisq = sum(((vol_data-linearfit(temp_data,*fit_parameters))/uncertainty)**2) print chisq dof = len(temp_data)-len(fit_parameters) print dof 37

38 Revisit the First Example Is this a good fit? 2 = X16 i=1 apple presdatai fit i uncertainty 2 = 65.6 Divide this by the DOF We have 16 data points, 2 parameters 2 DOF = =4.68 This may not be a great fit... 38

39 Goodness of Fit Previous statements only mostly true More accurately: χ 2 >> 1 is a very poor fit, maybe even a fit model which doesn t match χ 2 > 1 is not a good fit, or the uncertainty is underestimated χ 2 << 1 means the uncertainty could be overestimated 39

40 Summary You should now be well prepared to use python to fit the data Your practice with this starts with the next pendulum exercise, which you can begin now! 40

Python Analysis. PHYS 224 September 25/26, 2014

Python Analysis. PHYS 224 September 25/26, 2014 Python Analysis PHYS 224 September 25/26, 2014 Goals Two things to teach in this lecture 1. How to use python to fit data 2. How to interpret what python gives you Some references: http://nbviewer.ipython.org/url/media.usm.maine.edu/~pauln/