Introduction to Python Practical 2 Daniel Carrera & Brian Thorsbro November 2017 1 Searching through data One of the most important skills in scientific computing is sorting through large datasets and extracting the information that is interesting. On your PC, create a folder for the exercises. Download the Hipparcos catalogue as a text file from the web page for ASTM13 (below). Be careful not to save it as an html file. Check the file with Notepad to make sure that it only contains data. http://www.astro.lu.se/education/utb/astm13/hipparcos.txt In the following we assume that the data file is called hipparcos.txt. Start python/spyder. Change the Current Directory to your folder. Open the script editor (press New File ) and type in the following script: # Load the functions from the numpy and matplotlib libraries from numpy import * from matplotlib.pyplot import * # Read from the file into the array data(:,:) data = loadtxt( hipparcos.txt ); # Columns. HIP = data[..., 0] # (---) Hipparcos number. l = data[..., 1] # (deg) Star longitude. b = data[..., 2] # (deg) Star latitude. p = data[..., 3] # (mas) Parallax. ul = data[..., 4] # (mas/yr) Proper motion, l direction. ub = data[..., 5] # (mas/yr) Proper motion, b direction. ep = data[..., 6] # (mas) Standard Error in parallax. el = data[..., 7] # (mas/yr) Standard Error in proper motion, l direction. eb = data[..., 8] # (mas/yr) Standard Error in proper motion, b direction. V = data[..., 9] # (mag) Visual magnitude. col = data[...,10] # (mag) Colour index, B-V. mult = data[...,11] # (---) Stellar multiplicity. 1
Extract all stars with a parallax less than 1 mas. This is a list of all stars farther than 1000 pc. # # mask is an array that contains ones and zeros. # one == True == Parallax less than 1 mas # mask = (p < 1) subset = p[mask] print(size(subset)) Stars with parallax less than 1 mas are located farther than 1000 pc. Determine the fraction of the Hipparcos catalogue that is farther than 1000 pc. print(size(p[mask]) / size(p)) 1.1 Map of Hipparcos stars In this section we are going to make a map of the local region of the galaxy, in order the study the distribution of red and blue stars. To do this well, we need to think about the best way to project the celestial sphere onto a flat plane so as to minimize distortion. I recommend a little-known projection by Soviet cartographer Vladimir Kavrayskiy (1884-1954). It has a simple formula, and does a very good job at preserving area and shape: x = 3 l 1 ( ) 2 b 2 3 π y = b Where b [ π/2, π/2] and l [ π, π] are latitude and longitude (respectively). For illustration, here is a map of the Earth in this projection (source: wikipedia.org): 2
This projection is called Kavrayskiy VII. Enter the following lines to produce a Kavraiskiy VII projection of the red and blue stars in the Hipparcos catalog: # Convert latitude and longitude to radians. b = b * pi/180 l = l * pi/180 # Wrap around after longitude > pi, so it goes from -pi to pi. l[l > pi] = l[l > pi] - 2*pi # Do the Kavrayskiy VII projection. y = b x = l*3/2 * sqrt( 1/3 - (b/pi)**2 ) # Masks for blue and red stars. blue = (col < 0) # Stars with colour index B-V < 0.0 red = (col > 1) # Stars with colour index B-V > 1.0 # Final plot. figure(1) plot( x[red], y[red], r., x[blue], y[blue], b. ) legend( B-V > 1, B-V < 0 ) title( Hipparcos stars - Kavrayskiy VII projection ) ylabel( Galactic latitude ) xlabel( Projection of galactic longitude ) In the end, you should have a plot similar to this: 3
Discuss the distribution of red and blue stars with your classmates. Here are some interesting questions that you might want to think about: What are the main differences between the blue and red stars? Where in this picture can you find the galactic centre? (latitude = 0, longitude = 0). What could cause the two prominent over-densities of blue stars? What could cause the deep void of blue stars near the centre of the plot? Why does this void not affect red stars as much? Are there important biases in the sample? Why are there some blue stars at high galactic latitudes? Try to have an interesting discussion with your colleagues before moving to the next session. 2 Generating random data In science it is often necessary or useful to generate simulated data. For example, the first problem set for ASTM21 is to take the 2D galaxy distribution in the Hubble Ultra Deep Field (HUDF) and determine whether it is uniform. For this project it may be helpful to produce a few simulated galaxy distributions that are uniform and try to write a statistic that can distinguish the simulated data from the real Hubble data. Enter the following code in python. Here we use the random.uniform function to produce a star field with a uniform distribution. # Number of stars. nstars = 1000 # X and Y position. u_x = random.uniform(0,1,nstars) u_y = random.uniform(0,1,nstars) # Plot the star field. figure(2) plot(u_x,u_y, b. ) Enter the following code. This version uses normal instead of uniform. The function normal produces random values following the standard normal distribution (zero mean, variance one). Thus, the following version produces a star field more akin to a star cluster. 4
# Normal distribution c_x = random.normal(0,1,nstars) c_y = random.normal(0,1,nstars) # Plot the cluster figure(3) plot(c_x,c_y, b. ) Lastly, we would like to combine these two datasets. That would produce a more realistic star field around an open cluster. The star field would have a combination of stars from the cluster and background stars. Start by plotting the two data sets. You will have to modify the data sets slightly to get reasonable results. Here is my solution, but I encourage you to experiment. figure(4) plot(u_x*10-5,u_y*10-5, r.,c_x,c_y, b. ) Once you are happy with your plot, join the data sets accordingly: x = concatenate((u_x*10-5, c_x), axis=0) y = concatenate((u_y*10-5, c_y), axis=0) You have now produced a simulated (x,y) dataset that is similar to what you might observe on a CCD image of a star cluster. 3 Sample application We want to pick a star at random. Because the data was randomly generated, we can pick star 1. Plot the star field in blue and put a red + on star 1: x1 = x[0] y1 = y[0] figure(5) plot(x,y, b.,x1,y1, r+ ) First, find all the neighbours of star 1. The definition of neighbour is a bit arbitrary. In the following example I define it as the set of stars within distance 1 of star 1. But you should experiment with different distance values: 5
# Distance that defines a neighbour d = 1 # Find the distance to every other star. r = sqrt( (x1 - x)**2 + (y1 - y)**2 ) # Select those that have r < d. nbhr_x = x[ r < d ] nbhr_y = y[ r < d ] # Plot the neighbours with a black circle. plot(x,y, b.,x1,y1, r+,nbhr_x,nbhr_y, ko,x1,y1, r+ ) # Count the number of neighbours of star 1. num_neighbours = sum( r < 1 ) print(num_neighbours) We can write a python function to help us find the globular cluster in our artificial star field. First, we can define the local density at the point (xp,yp) as the number of stars within distance d of (xp,yp). Write a function to compute the local density of a star field, the indentation is important as it signifies to python which lines are part of your cunftion. Put the function in your script file, such that it appears before you need to use it the first time: # function: find density around xp,xy given stars in starsx,starsy # returns the density def density( starsx, starsy, xp, yp ): d = 1 r = sqrt( (xp - starsx)**2 + (yp - starsy)**2 ) return sum( r < d ) Confirm that this function is correct by confirming that it gives the same number of neighbours that you obtained earlier for star 1: print(density(x,y,x1,y1)) Now plot the local density along the X axis. rho = zeros(21) # allocate memory yp = 0 for i in range(0,21): # element 0 is included, but 21 is not included! xp = i - 10 rho[i] = density(x,y,xp,yp) figure(6) plot(arange(-10,11,1),rho) 6
The plot is probably not very smooth. Can you improve the for loop to produce a better plot? Based on this plot, how would you define the edge of the cluster? Alternatively, we could use the density function to determine which stars belong to the star cluster. A simple implementation would look like this: cluster_x = [] cluster_y = [] rho_min = 50 for i in range(0,size(x)): if density(x,y,x[i],y[i]) > rho_min: cluster_x = insert(cluster_x,size(cluster_x),x[i],axis=0) cluster_y = insert(cluster_y,size(cluster_y),y[i],axis=0) figure(7) plot(x,y, b.,cluster_x,cluster_y, r+ ) Experiment with different values of rho min and d. Choose a good set of parameters to find the star cluster. Compare answers with your class mates. Ideally you would like to find a routine that reliably finds the cluster for all the generated data sets. Could the routine be improved if it was based on the mean density of the star field? Try to implement a cluster-finding routine that uses the mean star density rather than a hard-coded value. 4 Code profiling Solving linear systems (Ax = b) is one of the most common and most expensive operations in scientific computing. For example, this operation is used for linear least squares optimization. In the following code example, we use the python library timeit to profile the cost of this operation: import timeit t = zeros(500) for n in range(1,501): A = random.uniform(0,1,(n,n)) b = random.uniform(0,1,n) tic = timeit.default_timer() for i in range(0,5): linalg.lstsq(a,b) toc = timeit.default_timer() t[n-1] = (toc-tic) / 5 # n starts on 1 but first index is 0 figure(8) plot(t) 7
There is a risk that some times random.uniform() will produce a matrix that just happens to be easy to invert. Can you suggest a way to improve the above for-loop to minimize this risk? 8