Finding the fastest stars in the Galaxy A machine learning approach

Size: px

Start display at page:

Download "Finding the fastest stars in the Galaxy A machine learning approach"

Rosanna Morris
5 years ago
Views:

1 Finding the fastest stars in the Galaxy A machine learning approach MASTER THESIS MASTER OF SCIENCE in ASTRONOMY & DATA SCIENCE Author : M.Trueba van den Boom Student ID : Supervisors : Assoc.Prof. dr. E. M. Rossi MSc. T. Marchetti 2 nd corrector : Prof. dr. S. Portegies Zwart Leiden, The Netherlands. 27 June 2018

Box 9513, 2300 RA Leiden, The Netherlands 27 June 2018 Abstract Hypervelocity stars (HVSs) are stars that surpass the Galactic escape speed at their position, and whose orbits are originated in the

3 3 Finding the fastest stars in the Galaxy A machine learning approach M.Trueba van den Boom Leiden Observatory, University of Leiden P.O. Box 9513, 2300 RA Leiden, The Netherlands 27 June 2018 Abstract Hypervelocity stars (HVSs) are stars that surpass the Galactic escape speed at their position, and whose orbits are originated in the Galactic center. These two characteristics make them unique tools to investigate the Galactic potential and the dark matter halo mass distribution, as well as the stellar population of the Galactic center. Unfortunately, only a handful of HVSs have been found so far. In this thesis, we present the use of a binary classifier artificial neural network to search for HVSs in the recently published second data release (DR2) of the European Space Agency satellite Gaia. We only use the five-parameter astrometric solution of the stars to train the network and make predictions accordingly. To optimize the algorithm s performance, we select the numerical hyperparameters with the use of a Bayesian optimization algorithm. After applying the neural network and subsequent observational cuts on the Gaia DR2 subset of 1.3 billion stars that present parallaxes and proper motions, we remain with a clean sample of 140 stars with 80% predicted probability of being HVSs. Based on the total velocity distribution of the stars for which Gaia provides radial velocities, we expect to find 49 high-velocity stars (v tot > 400 km s 1 ) within the final list of candidates. Planned observations with the Very Large Telescope (Chile) and the Isaac Newton Telescope (Spain) will demonstrate the effectiveness of our data mining routine.

5 Contents 1 Introduction 7 2 The neural network model Architecture Forward propagation The backpropagation algorithm Overfitting Pre-training setup Building the mock catalog Data Division Feature scaling Hyperparameter tuning Hand chosen selection Choice of the activation function Choice of the cost function Weight initialization Gradient descent optimization method A Bayesian Optimization approach Gaussian Processes and Expected Improvement Optimization results Results Final performance of the algorithm Application to Gaia DR A Monte-Carlo approach for errors Comparison to earlier work Application to Gaia DR Conclusion and future prospects 47 Acknowledgments 49 References 54 A Appendix 55 A.1 Size of the training set A.2 System parameters A.3 Gaia DR2 columns

6 List of Figures 1.1 Velocity-distance for stars in the Galaxy Feed-forward dense neural network architecture Single neuron in the neural network Sigmoidal activation functions Second derivatives of the activations Derivative of the sigmoid logistic activation Gaussian process and expected improvement Learning curves of both catalogs Neural network s output for Gaia DR Hypothesis distributions of HVSs and BSs Relative error bars against discretized hypothesis Parallax and proper motions relative error histograms Number of candidates retrieved from previous studies Networks output for previously found HVSs and RSs Candidates stars selected per threshold Total velocity distribution of the candidate list A.1 Evaluation scores for percentages of training data List of Tables 3.1 Parameters for the assumed Galactic potential Data division into sets Hyperparameter options for the Bayesian Optimization Optimal neural network hyperparameters Confusion matrices for the neural network predictions A.1 System parameters A.2 Description of the Gaia DR2 columns

7 Chapter 1 Introduction Hypervelocity stars (HVSs) are among the fastest objects in the Galaxy. Observationally, they are defined by two main attributes: their orbits are originated in the Galactic center (GC), and their velocity exceeds the Galactic escape speed at their location, making them unbound from the Milky Way (Brown et al. 2005). The Galactic escape speed attains values 600 km s 1 in the central regions of the Galaxy, and drops to 400 km s 1 at Galactocentric distances 50 kpc (Williams et al. 2017). Such extreme radial velocities are enough for a HVS to travel from the GC to the outer halo within its typical lifetime. Because of this rare characteristic, they carry a deep print of the Galactic potential. Studying HVSs can tell us more about the Milky Way s potential and the dark matter halo mass distribution (Rossi et al. 2017), as well as provide crucial information to constrain the poorly understood properties of the stellar population in the GC (Genzel et al. 2010; Zhang et al. 2013; Madigan et al. 2014). HVSs are scarce and hard to detect. Only around 20 unbound stars have been found so far (Brown et al. 2014). Figure 1.1 shows the total velocity in the Galactic rest-frame of high-velocity stars (left) and "normal" background stars (right) as a function of their Galactocentric distance. Only a few points lie above the Galactic escape speed, drawn as a dashed line. These plots show the scarceness of unbound stars at the high-velocity tail of the Galactic velocity distribution. There are several proposed ejection mechanisms that could explain the extreme velocities that characterize HVSs. The Hills mechanism is the most plausible among them (Hills 1988). It involves the gravitational disruption of a binary star system by the Massive Black Hole (MBH) at the center of the Milky Way, Sagittarius A*. This scenario predicts an isotropic distribution of HVSs in the sky, and is consistent with an ejection rate between yr 1. Other ejection mechanisms include the interaction of a globular cluster with a supermassive black hole (Capuzzo-Dolcetta & Fragione 2015), the encounter between a single star and a massive black hole binary (Sesana et al. 2006, 2008), and the tidal disruption of a dwarf galaxy (Abadi et al. 2009). All aforementioned mechanisms also predict ejected stars that do not acquire sufficient energy during the gravitational interaction to exceed the Galactic escape speed. Such stars are bound to the Milky Way, hence they are called bound HVSs (Bromley et al. 2006; Kenyon et al. 2008). Their lower velocities ( km s 1 ) let them adopt a large variety of orbits, preventing easy identification (Marchetti et al. 2017).

8 Introduction Figure 1.1: Total velocity in the Galactic rest-frame as a function of Galactocentric distance. The dashed lines show the median posterior escape speed from the Galaxy (Kenyon et al.

8 8 Introduction Figure 1.1: Total velocity in the Galactic rest-frame as a function of Galactocentric distance. The dashed lines show the median posterior escape speed from the Galaxy (Kenyon et al. 2014; Williams et al. 2017). Left: 39 high-velocity stars found by Brown et al. (2018a). The color indicates the probable origin: Galactic center (blue), Galactic disk (red), Galactic Halo (green), and Ambiguous (empty). Right: All the 6,869,707 stars in Gaia DR2 that present radial velocities and for which the relative error on total velocity is smaller than 0.2 (Marchetti et al. 2018a). The colors are proportional to the logarithmic number density of stars. Stars with v 600 km s 1 are artifacts from large uncertainties in the measurements and do not represent real stars. Not all ejection mechanisms are restricted to the GC. High-velocity stars with orbits that are not consistent with coming from the GC are referred to as runaway stars (RSs), or hyper runaway stars (HRSs) if they are unbound from the Galaxy (Blaauw 1961). The two main mechanisms that describe their possible origin are the explosion of a supernova in a compact binary system (Portegies Zwart 2000; Geier et al. 2015) and the dynamical encounters between stars in dense stellar systems (Leonard & Duncan 1990; Gvaramadze et al. 2009). The rate at which HRSs are produced is estimated to be much lower than for HVSs, namely yr 1 (Perets & Šubr 2012; Brown 2015). Since December 2013, the European Space Agency satellite Gaia has been observing stars in the Milky Way (Gaia Collaboration et al. 2016). In September 2016, the first data release (Gaia DR1, Brown et al. 2016; Lindegren et al. 2016) presented parallaxes and proper motions for a subset of approximately 2 million stars in common with the Tycho-2 catalog (TGAS, Michalik et al. 2015). The second data release (Gaia DR2, Katz et al. 2018; Brown et al. 2018b; Lindegren et al. 2018), available to the public since the 25th of April 2018, is the most complete and accurate stellar map of the Milky Way. The full catalog contains over 1.7 billion sources, where a subset of 1.3 billion present the five astrometric parameters. The astrometric solution is given by the right ascension (α), declination (δ), parallax (ϖ) and proper motions in both directions of the sky (µ α,µ δ ). Gaia DR2 presents an enormous and unique data set with stars of still unknown nature. Where previous analysis expected Gaia DR1 to contain no more than a couple of HVSs, the expected number of HVSs in Gaia DR2 lies between the hundreds and the thousands (Marchetti et al. 2018b). Gaia DR2 also

9 9 provides radial velocities for roughly 7 million sources. For these stars, the total velocity can be computed directly. This work has been done recently by Marchetti et al. (2018a), finding 5 HVS candidates and 23 HRS candidates. Other recent studies have looked at the complete catalog and already claim to have found new unbound objects (Boubert et al. 2018; Shen et al. 2018). We have mentioned the important role HVSs play in order to provide valuable information about the Galactic potential and the stellar characteristics of the GC. Previous studies have primarily tried to find HVSs looking for young stars in the outer halo. Since this is not a stellar forming region, it would mean that those stars had come from another region, making them outliers, easy to detect. This strategy, however, induced an observational bias towards the massive B-type stars that were systematically targeted. Gaia provides an unprecedented opportunity to continue the search for HVSs using a different approach. To process such a large amount of data, we need an automated data mining routine capable of detecting a few outliers within more than a billion stars. Marchetti et al. (2017) present a novel technique based on the use of an artificial neural network to search for HVSs in the TGAS subset of Gaia DR1. Artificial neural networks are a type of supervised machine learning, capable of identifying hidden patterns in multi-dimensional parameter structures in order to make classifications, making them suitable tools for the problem at hand. Although their origins go back to the 1940s, neural networks were scarcely used before the 1990s due to their expensive computations and the absence of proper algorithms to train them. With the increasing improvements on computational power and the development of the graphics processing units (GPUs), the popularity of neural networks has risen fast. Nowadays, they are efficient and effective algorithms, widely used for both commercial purposes and scientific research. In this thesis, we present a careful analysis and optimization of the novel technique presented by Marchetti et al. (2017) to search for HVSs in Gaia with the help of an artificial neural network. Furthermore, we continue and expand his work, applying the algorithm to Gaia DR1, comparing the results, and finally, searching for HVSs in the recently released Gaia DR2. To avoid photometric and metallicity cuts that could bias our search towards specific stellar types, we choose to only use the five astrometric parameters as the input in our data mining routine. Restricting ourselves to a search only based on astrometry has the advantage that it prevents any a priori assumption on the stellar nature of the HVSs. Since we are only interested in knowing if a star is a HVS or not, we implemented a binary classifier neural network, which outputs a scalar in the range [0,1], were 0 corresponds to a background star (BS, all stars that are not HVSs) and 1 to a HVS. This thesis is outlined as follows: in Chapter 2 we give a short introduction onto the basics of neural networks, in Chapter 3 we discuss the data set used to train the algorithm, and in Chapter 4 we select the network s hyperparameters. Then, the neural network is applied on Gaia DR1 and DR2 in Chapter 5, and at the end, we discuss our conclusions and present possible future work in Chapter 6.

11 Chapter 2 The neural network model In this Chapter we give a small introduction on the fundamentals of artificial neural networks, using the one used by Marchetti et al. (2017) as an example. For a more indepth explanation about neural networks and machine learning in general, we refer the interested readers to Haykin (1998) and Haykin (2009). 2.1 Architecture Our brain uses lots of densely interconnected neurons for information processing. The idea behind an artificial neural network is to simulate this extremely large network of brain cells inside a computer, in order to get it to learn, recognize, and make decisions in a "human-like" way. A neural network is not explicitly programmed to learn, it does so by itself. A typical neural network can have anything from a few dozen to thousands of artificial neurons arranged in a series of layers. How the layers are interconnected depend mostly on the purpose of the network. For classification problems such as the one presented in this thesis, all neurons in a layer are connected to those in the next layer. This type of architecture is called a feed-forward dense network, since there are no connections within the same layer and the many connections create a dense web between the layers. Figure 2.1 shows a schematic drawing of a feed-forward dense network with four layers. The first layer, or input layer, contains one neuron for every feature x that needs to be fed to the network. In this case five, one for every astrometric parameter of the star. Neurons in this layer are called input neurons. The last layer, or output layer, contains a single neuron that signals how the network has responded to the learned information. Neurons in this layer are called output neurons. In between the input and output layers, there can be one or more hidden layers that form the body of the network. The name is derived from the fact that the input and output of the neurons in these layers are hidden from the interface of the network. The connections between neurons are represented by a number, which can be either positive or negative, depending on the influence (or weight) one neuron has on the other. In analogy to the brain, where the junction between two neurons is called a synapse, the numbers are referred to as synaptic weights, or simply as weights.

12 The neural network model Figure 2.1: Schematic figure of the architecture of a feed-forward dense neural network. It consists of an input layer with features x 1, x 2.

12 12 The neural network model Figure 2.1: Schematic figure of the architecture of a feed-forward dense neural network. It consists of an input layer with features x 1, x 2...x n, two hidden layers, and an output layer with a single neuron for binary classification. The +1 neurons represent the bias units. The lines interconnecting the neurons are called synaptic weights and represent the weighted numbers between neurons. 2.2 Forward propagation We have seen that a neural network consists of many neurons stacked in layers. But how does a single neuron work? The input neurons work differently than those in the hidden and output layers. They simply receive the astrometric features from the star and feed it, together with the corresponding synaptic weights, to the neurons in the first hidden layer. Contrary to the input neurons, the hidden neurons take several inputs from the previous layer. To compute the output, the neuron calculates a non-linear function f (z), where z is a linear combination of the input vector x = (x 1, x 2,..., x n ) times the corresponding weight vector θ = (θ 1,θ 2,...,θ n ): z( x; M θ) = x 0 θ 0 + x i θ i, 2.1 where M is the number of neurons in the previous layer and x 0 1 is called the bias unit, an extra neuron added to each pre-output layer. Bias units aren t connected to any previous layer so they don t represent a true activity, but they still have outgoing connections and therefore contribute to the output of the neuron by adding a constant term that enables the function f to shift, where the summed part of z only is able to change its slope. This may be critical for successful learning. i=1 The function f (z) is called the activation function and does non-linear functional mappings of the variable z. It is the combination of many non-linear functions what enables the neural network to recreate any arbitrary complex function to fit the data properly. Marchetti et al. (2017) choose for a variation of the standard hyperbolic tangent function for the activation of the neurons in his network:

13 2.3 The backpropagation algorithm 13 Figure 2.2: Schematic figure of a single artificial neuron in the neural network. The neuron calculates a non-linear function f (z), where z is a linear combination of a set of inputs (x 1, x 2...x n ) together with the bias unit (+1), and their corresponding synaptic weights (θ 0,θ 1...θ n ). It sums them and uses the activation function to map the output into a desirable range. f (z) = a tanh(bz), 2.2 with a = and b = 2/3 (LeCun et al. 1998). In 4.1.1, we will compare this choice with other activation functions and discuss their advantages and disadvantages for classification problems. The resulting output of the activation function is fed to the neuron in the next layer. The output of the neuron in the last layer h Θ ( x) is called the hypothesis, and depends on the features used and the ensemble of weights in the network. For binary classification problems, it is useful to map the hypothesis in a range [0,1], where 1 corresponds to a HVS and 0 to a BS. This way, we can interpret the output of the neural network as a star s probability of being a HVS. The synaptic weights of a layer l (except the output) can be arranged as a matrix 1 Θ (l) R s l+1 (s l +1), where the matrix dimension is given by the number of neurons in the next layer s l+1 and the number of neurons, including the bias unit, in its own layer s l + 1. The concatenation of all these matrices gives the ensemble of weights in the network Θ = (Θ 0,Θ 1,...,Θ L 1 ), where L is the total number of layers in the network. 2.3 The backpropagation algorithm When a trained neural network makes predictions on unlabeled data, the information only flows forward in the net, from the input features to the output hypothesis. But during training, there is an element of feedback involved that teaches the network to adapt to the optimal weights. For this, the information also needs to propagate backward through the layers, hence the name of the algorithm. After the hypothesis is predicted, it is compared to the true value of the labeled data using the mean squared error (MSE) cost function: 1 In the following, we will use superscripts in round brackets to refer to a particular vector, and subscripts to specify its components.

14 14 The neural network model J(Θ) = 1 2 m i=1 ( h Θ ( x (i ) ) y (i )) The MSE is the sum of the Euclidean distance between the hypothesis and the true value of star i for every star in the batch (see Chapter 4 for a complete list of the network s hyperparameters and their descriptions). From this follows immediately that we need to minimize the cost function in order to optimize the network s results. There are many more cost functions than the MSE, but they all meet that same requirement. In 4.1.2, we compare the MSE to another function, more suitable for binary classification. In order to minimize the function J(Θ), we need its partial derivative with respect to every weight in the network. This expression tells us how quickly the loss changes when we change the weights. To get the derivative, we first calculate the error δ of every neuron j for every star i. For simplicity, we will omit the specific star index annotation from Equation Equation 2.7. The error in the output layer L is given by: δ (L) = J(Θ) a (L) f ( z (L)), 2.4 with a (L) the vector of outputs of the activation for the last layer, and f the derivative of the activation function f, evaluated with the input values given by z (L). Note that for the MSE cost function, this equation simply becomes the difference between the predicted and the true output of a star times the derivative of the activation function: δ (L) = ( a (L) y ) f ( z (l)). 2.5 Since the errors depend on both the error criteria and the activation function, we can calculate δ in every previous layer l with: δ (l) j = (( Θ (l+1)) δ (l+1) ) f ( z (l)), 2.6 where denotes the transpose of the underlying matrix and indicates a terms-wise matrix multiplication, or Hadamard product. Finally, we get the partial derivative of the cost function with: Θ (l) j J(Θ) = a (l 1) j δ (l) δ (l) j if j 0 j if j = 0, 2.7 since j = 0 is a bias unit and there are no connections feeding it. The weights are then updated with a gradient descent method, an algorithm that tries to find the minimum of a function, which in this case is the cost function J(Θ). Marchetti et al. (2017) use an adaptive stochastic gradient descent method with a specific learning rate η k for the t-th update of the k-th weight in the network: θ k (t) = η k (t)g k (t) = t i=1 η ( gk (i ) ) g k(t), 2.8 2

15 2.4 Overfitting 15 where η = η 0 /(1+γt) 2 and g is the gradient of the cost function. This implementation is called the AdaGrad optimizer (Duchi et al. 2010). It contains two hyperparameters: the learning rate η, that decides the step-size of the weight update, and the learning rate decay γ, that diminishes the learning rate with every iteration to ensure the algorithm converges towards the minimum and does not scatter due to random noise in the batches. In 4.1.4, we discuss the application of different optimization methods. 2.4 Overfitting When implementing a neural network, it is critical to make sure that the algorithm doesn t unknowingly extract some of the noise in the data as if that variation represented underlying model structure. This problem is called overfitting, and it is one of the major problems in neural networks. The non-linear mapping of the algorithm can, in theory, fit every distribution in the total feature-space, correctly classifying outliers on the training set but failing to generalize on the test set (Tetko et al. 1995; Lawrence et al. 1997). One must be careful to avoid this in order to achieve maximal performance on real, unlabeled data. Regularization techniques counteract overfitting. Three frequently applied methods are l 1, l 2, and dropout. l 1 and l 2 regularization add an extra term to the cost function in order to suppress the influence larger weights can have on the network s mapping. Since the cost function needs to be small, the summation of large terms makes it harder for the gradient descent algorithm to minimize the function. They are given by: l 1 J(Θ) = J(Θ) + λ l 2 J(Θ) = J(Θ) + λ L l=1 j =0 L l=1 j =0 m θ l,j, 2.9 m θ 2 l,j, where L stands for the number of layers, m the number of neurons in layer l, and λ is a regularization parameter that decides the scale of the penalization. Although the two methods look pretty similar, their functionality differs significantly. The absolute term in l 1 penalizes all weights linearly the further they get from zero, while the quadratic term in l 2 penalizes larger weights much more than smaller ones, forcing all the weights to be as small as possible (Ng 2004; Moore & DeNero 2011). Dropout is a technique where a fraction of the neurons in a layer is ignored during every iteration (Srivastava et al. 2014; Helmbold & Long 2016). Deactivating neurons randomly prevents the network from co-adapting and relying too much on some specific neurons for the training. This simple technique reduces overfitting, often better than regularization (Phaisangittisagul 2016). Usual values for dropout go up to 0.2 (20% of the neurons in the layer are dropped). In we discuss how we implemented the three regularization techniques in the neural network. 2 Marchetti et al. (2017) use γ = 0, therefore the denominator becomes 1 and η = η 0.

17 Chapter 3 Pre-training setup 3.1 Building the mock catalog The neural network needs a significant amount of training samples in order to find the relevant boundaries to correctly classify the stars. Since the HVS - BS ratio estimate in Gaia is only 1:10 6 (Marchetti et al. 2018b), using a natural distribution of the stars would result in very imbalanced classes, which usually prevents good classification performances (He & Garcia 2009). To avoid this, we need to oversample the minority class in the training, which can only be achieved building a mock population of HVSs. We will train the neural network on two independent catalogs, each containing a different population of HVSs distributed around the Galaxy. The first one is the original catalog used by Marchetti et al. (2017), and the second one is called the Hills catalog 1. In this section, we will briefly outline the methods used by both catalogs to create the mock population of HVSs. For a more extended explanation, we refer the interested reader to Marchetti et al. (2018b). Both catalogs use the Hills mechanism to model the initial velocity distribution of stars ejected from the GC. We start by populating the GC with a synthetic population of binary stars, following the method outlined by Rossi et al. (2014, 2017). The binary distributions are modeled as power-laws: f a a 1 for the semi-major axis a, where the exponent is chosen according to Öpik s law (Öpik 1924), and f q q 3.5 for the mass ratio q. This combination results in a good fit between the observed sample of late B-type HVSs in Brown et al. (2014) and the prediction of the Hills mechanism for reasonable choices of Galactic potentials (Rossi et al. 2017). After being disrupted, one of the stars from the binary is ejected with a velocity v e j (Sari et al. 2010; Kobayashi et al. 2012; Rossi et al. 2014): v e j = 2Gm c a ( M m t ) 1/6, 3.1 where M = M is the mass of the MBH in the Milky Way (Ghez et al. 2008; Gillessen et al. 2009; Meyer et al. 2012), m t is the total mass of the binary, m c is the 1 The Hills catalog was generated using a publicly available PYTHON module that can be found in

18 18 Pre-training setup mass of the companion star (not the one ejected), and G is the gravitational constant. The original catalog directly populates the galactic space (l, b) with HVSs in a mass range M [0.1 9]M and at a distance d [0,40]kpc from the Earth, assuming steps of l 0.9, b 4.5, r 1 kpc and M 0.2M. Then, it computes the proper motions and radial velocities which are consistent with an object moving away from the GC at that position, correcting for the motion of the Sun and of the local standard of rest (Schönrich 2012). From the ejection velocity distribution, it is possible to make an estimation of the flight time of that star, and subsequently calculate the age of the star. Using the stellar evolution code SeBa (Portegies Zwart & Verbunt 1996; Portegies Zwart et al. 2009), we can then obtain the radius, effective temperature, and other stellar parameters. The total velocity is computed decelerating the stars in a four-component Galactic potential, consisting of a point mass black hole potential: Φ B H (r GC ) = GM r GC, 3.2 a spherically symmetric bulge modeled as a Hernquist spheroid (Hernquist 1990): Φ b (r GC ) = GM b r GC + r b, 3.3 a Miyamoto-Nagai disc in cylindrical coordinates (R GC, z GC ) (Miyamoto & Nagai 1975): GM d Φ d (R GC, z GC ) = ), R 2 GC (a + d + z 2 GC + b2 d and a Navarro-Frenk-White (NFW) dark matter halo profile (Navarro 1996): Φ(r GC ) = GM ( h ln 1 + r ) GC. 3.5 r GC r s Here, r GC denotes the distance from the star to the GC. The potential parameters of the mass and radius for the bulge and the disc are taken from Johnston et al. (1995), Price-Whelan et al. (2014) and Hawkins et al. (2015). The parameters for the NFW halo profile are the best-fit values derived in Rossi et al. (2017). All parameters are summarized in Table 3.1. The Hills catalog uses a more realistic approach. It performs a full orbital integration of the HVSs in the Milky Way, from the moment the binary disruption takes place. The initial positions of the stars are at a distance close to the radius of the gravitational sphere of the MBH s influence, with a velocity vector pointing radially away from the GC and a magnitude according to Equation 3.1. The orbits are integrated with the Galpy package 2 (Bovy 2015), using a Dormand-Prince integrator (Dormand & Prince 1980). The integrated time is equal to t(m) = ɛ 1 ɛ 2 t MS (M), where M is the mass of the star, t MS (M) its main sequence lifetime, and ɛ 1,ɛ 2 are two random numbers uniformly distributed in [0,1]. At the end of the integration, we get the velocity at the star s observed position. A thing that has to be accounted for, is that Gaia 2

19 3.2 Data Division 19 Component Bulge Disk Halo Parameters M b = M r b = 0.7 kpc M d = M a d = 6.5 kpc b d = 0.26 kpc M h = M r s = 24.8 kpc Table 3.1: Parameters used for the four-component Galactic potential. only provides parallaxes and proper motions for stars with a Gaia G band magnitude (G mag ) 21. This limit will be crossed by many samples in the mock-star generator. For this reason, we remove all stars with G mag > 21 at the end of the integration. The original Gaia data contains negative parallaxes. To account for this in the mock data, the distance d is transformed to parallax as follows. First, the parallax ϖ = 1/d and the relative error in parallax z ϖ σ ϖ /ϖ are estimated using the PyGaia toolkit 3. Then, for every distance, the parallax is drawn out of a Gaussian distribution centered at ϖ and with a standard deviation σ ϖ. This way, a part of the distribution can cross zero and enter the negative regime, including so negative parallaxes in the catalog for faint stars with non-negligible relative errors. The Hills catalog has two major differences from the original catalog. The orbital integration allows the orbits of HVSs to differ from straight lines due to the torque applied on them by the axisymmetric stellar disk. Stars ejected at perpendicular angles with respect to the disk will present straighter orbits than stars at lower angles. Because of this, not all HVSs will point radially away from the GC, but will cover the whole space instead. The second large difference is that the velocity distribution of the stars in the Hills catalog peaks at 400 km s 1. This means that the catalog will also contain bound HVSs, which are not present in the original catalog. We create a Hills catalog containing HVSs, which we fill with randomly picked background stars from a simulated end-of-mission Gaia catalog for the Milky Way: the Gaia Universe Model Snapshot (GUMS, Robin et al. 2012). In the end, the complete catalog consists of 1.8 million stars, with a 22% - 78% ratio of HVS - BS respectively (Table 3.2). See Appendix A.1 for a clarification on the size of the catalog. 3.2 Data Division We randomly split both catalogs into three sets: a training set, a cross-validation set, and a test set. The training set is used for the final training of the neural network. 3

20 20 Pre-training setup This set has to be the largest of the three because the algorithm needs a vast sample of stars to be able to correctly draw a boundary line in the five-dimensional feature space. The cross-validation set is used to train the network with the Bayesian Optimizer algorithm in order to select the optimal hyperparameters (see 4.2). This training cannot be done on the training set to avoid selecting hyperparameters biased towards a specific group of stars. Lastly, the test set is used to measure the performance of the algorithm (see 5.1). This also needs to be done on an independent set to detect when the algorithm is overfitting the training set. The set s statistics are laid out in Table 3.2. All sets maintain a 1:3.5 HVS to BS ratio. Data sets % of data HVS BS Total stars Hills catalog Training set 80% 319,364 1,117,885 1,437,249 Cross-validation set 10% 39, , ,656 Test set 10% 40, , ,657 Total 100% 399,236 1,397,326 1,796,562 Original catalog Training set 60% 415,013 1,102,549 1,517,569 Cross-validation set 20% 138, , ,588 Test set 20% 138, , ,181 Total 100% 692,268 1,837,063 2,529,338 Table 3.2: Statistics of the data division into training, cross-validation and test set for both catalogs. We need different sets for the training, hyperparameter search and the performance testing to avoid biases. The reason for the difference in size and the distribution of the sets between the two catalogs is explained in Appendix A Feature scaling Before feeding the astrometric features to the neural network, they are normalized to a mean of zero and a standard deviation of one, using the following equation: x i j = x i j µ j σ j 3.6 where x i j is the j-th feature of star i. µ j and σ j are respectively the mean and standard deviation of feature j. This process is necessary because the values of the parameters can vary widely, which can lead objective functions in the network to fail. More reasons to use feature scaling include a faster convergence of the gradient descent algorithm ( 2.3, Ioffe & Szegedy 2015) and an optimal weight initialization ( 4.1.3, Glorot & Bengio 2010).

21 Chapter 4 Hyperparameter tuning A neural network has a set of parameters that need to be carefully tuned in order to get an optimal performance, the so-called hyperparameters (Bengio 2012). Choosing the hyperparameters can be seen as model selection or as the initial conditions of the algorithm. Neural networks can have many, including those which specify the structure of the network itself and those which determine how the network is trained. We identify the following hyperparameters in our network: Number of hidden layers and the number of neurons per hidden layer. These hyperparameters determine the complexity of the non-linear mapping function. More layers and neurons will be able to capture easier the structure of the relevant instances. On the other hand, this also implies that the training will be more computationally expensive and the algorithm will be prompt to overfitting. This can be combated adding regularization to the network. Number of epochs. Looping over all the instances only once doesn t ensure a convergence of gradient descent. When the last star has been fed to the network and the weights have been updated, the training set is shuffled and the process starts again. Every loop over the full training set is called an epoch. The more epochs it is trained, the closer the network reaches convergence. This can also cause overfitting. Batch size. During training, the stars are fed to the neural network in batches. A batch is a group of stars that are processed at the same time. The batch size can range from all the stars in the training set (batch gradient descent) to a few stars (mini-batch gradient descent) or even a single star (stochastic gradient descent). After processing a batch, the network updates its weights to improve the predictions on the next batch. Weight initialization. At the beginning of the training, the weights are given random initial values. Depending on these values, the network can converge faster or get more easily stuck at a local minimum (Kumar 2017). Cost function. The cost function is the direct in-training measure for the performance of the network. Although they generally have the same properties, some are more suited for classification tasks than others.

22 22 Hyperparameter tuning Activation function. As discussed in 2.2, the activation function plays a big role in the training of the neural network. Only a careful choice of the activation per layer ensures maximum performance. Optimization method. Gradient descent optimizers determine how the weights are updated after every iteration. Standard optimization methods include the learning rate and decay hyperparameters. Other methods can contain extra parameters like the exponential decay rate over different moments. We emphasize that only a correct combination of hyperparameters leads to the optimal performance of the network. Unfortunately, there is not a standard recipe to choose the hyperparameters of a neural network. They are usually selected by hand or by a search algorithm. Ideally, we would perform a systematic grid search in the total hyperparameter space to find the best combination. Unfortunately, we have to train the network for every combination we try, which requires more computational power and time than we have at our disposal. To surpass this issue, Marchetti et al. (2017) use a Particle Swarm Optimization algorithm (PSO, Kennedy & Eberhart 1995) in order to find three hyperparameters: number of neurons in the first hidden layer, number of neurons in the second hidden layer, and the learning rate of the optimizer. The rest of the parameters are chosen a priori. Our model contains 17 hyperparameters (depending on the number of layers, see Table 4.2), which is too many to perform a PSO. Instead, we first choose by hand those whose characteristics are known well enough to make a confident choice, and after that, we select the numerical ones using a Bayesian Optimization algorithm, which is more suitable than the PSO for the optimization of many parameters, as it needs less trials in order to achieve a comparable result. 4.1 Hand chosen selection We start first with the hyperparameters that we can select a priori. In the following sections, we explain the parameter choices and the reasons for those decisions Choice of the activation function The choice of activation functions in neural networks has a significant effect on the training dynamics. Over the years, many different activation methods have been developed, which makes choosing the right one no easy task (Mhaskar & Micchelli 1994; Ramachandran et al. 2017). In this section, we will analyze three popular choices and select the most suitable for our network. Activation functions usually satisfy a series of useful properties: they are bounded, monotonic and continuous in their range. The typical activation choices for classification tasks are sigmoid functions, such as the logistic: f (z) = 1 (1 + e z ), 4.1

23 4.1 Hand chosen selection 23 f(z) Sigmoid logistic function Standard hyperbolic tangent Hyperbolic tangent (LeCun et al. 1998) f(z) = 1 (1+e z ) (-1.317, 0.211) (-0.658, ) (-0.988, ) (0.988, 0.991) (1.317, 0.789) (0.658, 0.577) f(z) = tanh(z) 1.5 f(z) = tanh( 2z 3 ) z Figure 4.1: Typical activation choices for classification tasks: the sigmoid logistic (blue), the standard hyperbolic tangent (red), and a variation on the hyperbolic tangent (purple). The points of the maximum second derivatives of the functions are marked with black dots. In the case of the logistic function, they correspond to the values and on the y-axis, which become the true values for the BSs and HVSs respectively. or the hyperbolic tangent f (z) = tanh(z). Figure 4.1 shows the logistic (blue) and hyperbolic (red) sigmoid functions. They look rather similar. In fact, tanh is just a scaled logistic function: tanh(z) = 2 logistic(2z) 1. LeCun et al. (1998) propose a customized tanh function (Equation 2.2, plotted purple in Figure 4.1), more efficient than the standard version. The constants a = and b = 2/3 have been chosen so, that when used with scaled features, the variance of the outputs is close to 1 because the effective gain of the function is roughly 1 over its full range. This choice has been proven to yield faster convergence. For all three activations, the gradient is highest at the functions center, which means that any small changes in the values of z in that region will cause a significant change in the values of f (z). This means that the functions have a tendency to bring the f (z) values to either end of the curve. For this reason, they are often used for classification purposes, since they make a clear distinction between predictions. A disadvantage of the tanh activation is that it ranges in [-1,1], instead of the desired [0,1] for a probabilistic interpretation of the output 1 (Saerens et al. 2002). Instead of mapping the result of the tanh to the right range after the output, we prefer a more natural choice and select the logistic function for the last layer of our network. This is also necessary to use the binary cross-entropy cost function described 1 There are no probabilistic elements in a neural network, so the outcomes are not really probabilities, but they can be interpreted as such.

24 24 Hyperparameter tuning f (z) f (z) = tanh( 2z 3 )sech2 ( 2z 3 ) (-0.988, 0.587) (-1.317, 0.096) f (z) = ez (e z 1) (e z +1) 3 (-0.658, 0.770) Sigmoid logistic function Standard hyperbolic tangent Hyperbolic tangent (LeCun et al. 1998) f (z) = 2 tanh( z)sech 2 ( z) (0.988, ) (1.317, ) (0.658, ) z Figure 4.2: Second derivatives of the three activation functions of Figure 4.1. The points of the extrema are marked with black dots. The corresponding y values in the original activation to the x positions of these points, mark the optimal true values for the stars in the training set. in Therefore, we select the variation on the hyperbolic tangent activation for all hidden layers, and the sigmoid logistic activation for the output layer. Contrary to what one might naively suspect, the true values of the stars in the training set should not be placed at the value of the sigmoid s asymptote. The training will try to push the outputs towards them, resulting in ever increasing weights, where the derivative of the activation is close to zero. This will, in turn, get multiplied with the weights, preventing their proper update. This process will stop, or at least drastically slow down the learning process (computations often hit floating point value limits), while the reduction of the classification errors was still possible. This problem in deep learning is known as the vanishing gradient problem (Bengio et al. 1994; Hochreiter et al. 2001). Instead, we choose the true values at the point of the maximum second derivative of the function. This way we take advantage of the nonlinearity without saturating the activation and maintaining an indication of the prediction s uncertainty (LeCun et al. 1998). Figure 4.2 shows the second derivative for all three activations. The point of the second derivative of the logistic function corresponds to the values and Those become the true values for the BSs and HVSs in the training set Choice of the cost function The MSE (Equation 2.3) is a typical choice for the cost function of a neural network, and performs fairly well for shallow nets (Zhou & Austin 1998). Nevertheless, it presents a problem while training that drives the gradients to vanish. The problem

25 4.1 Hand chosen selection f (z) = e z (1+e z ) 2 f (z) z Figure 4.3: First derivative of the sigmoid logistic activation. The function has a minimum at two points, meaning that the error of the neuron can adopt low values not only when it is classified correctly, but also when it has the exact opposite value. This will stop the learning process where it should continue, and in time, it will drive the gradient to values close to zero, creating a vanishing gradient problem. arises when we compute the gradient of the MSE with the backpropagation algorithm (Equation 2.4): m Θ J(Θ) = i=1 ( h Θ ( x (i ) ) y (i )) h Θ( x (i ) ) x (i ). 4.2 We can see appear a term h. This term is the derivative of the hypothesis, and adopts the form of the derivative of the activation function in the output layer. The derivative of the sigmoid logistic function is given by: f (z) = 1 z 1 + e z = e z (1 + e z = f (z) (1 f (z)). 4.3 ) 2 The right side of the equation will prove to be useful later on (Minai & Williams 1993). If we plot the gradient of the logistic function (Figure 4.3), we see a quadratic function that adopts a minimum at two points. This means that the function can attain small values not only when the output of the neuron is nearly optimal, but also when it comes close to the opposite value. This can cause the gradient to adopt values really close to zero, leading again to the vanishing gradient problem. A different cost function that solves this problem, and is, at the same time, more suitable for classification problems, is the cross-entropy function (Ibrahim & Mohamed 2017): J(Θ) = K i=1 [ y (i ) log ( h Θ ( x (i ) ))], 4.4

26 26 Hyperparameter tuning where i this time sums over all the possible classification types K. Rewriting this formula for a binary classifier (K = 2) and adding the batch elements, yields the binary cross-entropy cost function: J(Θ) = 1 m m i=1 [ y (i ) log ( h Θ ( x (i ) )) + ( 1 y (i )) log ( 1 h Θ ( x (i ) ))]. 4.5 Although it might look more complicated, it simply states -log P[data model] - log P[data model], an equivalent of Equation 4.4 for binary data. Note that the tanh(x) activation function and its variations can t work with this cost function since there would appear negative values in the logarithmic term. We now calculate the gradient of the binary cross-entropy function applying the chain rule twice: Θ J(Θ) = 1 m [ m i=1 y (i ) ( h ) Θ x (i ) 1 y (i ) ( 1 h ) Θ x (i ) ] h ( Θ x (i ) ) x (i ) = 1 [ m h Θ( x (i ) ) x (i ) ]( ( m i=1 h )( Θ x (i ) 1 h ( )) ( h Θ x (i ) Θ x (i ) ) ) y (i ). 4.6 Using the fact that f (z)/ z = f (z) (1 f (z)) (Equation 4.3) for the sigmoid logistic activation function, the derivative becomes: Θ J(Θ) = 1 m m i=1 (h Θ ( x (i ) ) y (i ) ) x (i ). 4.7 It now scales directly with the distance between the true value and the predicted output of the stars, without the h term! In fact, the cross-entropy function was designed specifically to have this nice attribute. This allows it to converge faster and often better than the MSE (Golik et al. 2013). We therefore choose the binary crossentropy cost function for the training of the neural network Weight initialization As we will see in a moment, the initialization of the weights play a crucial role in the amount of time deep neural networks need to converge. For shallow nets, this hyperparameter plays a lesser role. Although we haven t yet chosen the number of layers in our network, we know from comparable studies that we won t need more than four or five hidden layers since our number of features is very small (convolutional networks can use up to thousands of features). Nevertheless, it is worth to take some time to explain how we choose our initialization method. We know from Equation 2.1 that the linear combination z of a neuron depends on the weight vector θ. If the elements in θ are too large or too small, z will also be large/small and f (z) will become so as well, arriving at one of the asymptotes of the activation where the gradient is close to zero. If we reach this point, the network loses its non-linearity and therefore the advantage of having many layers. Weights located over the activation s linear region are large enough to continue the learning process

27 4.1 Hand chosen selection 27 without saturating the sigmoid, and have the advantage that the network will learn the linear part of the mapping before the more difficult non-linear part. We also know that the nature of the sigmoid functions will drive small weights to get smaller after every layer and large weights to become larger. This means that the more layers, the smaller θ needs to be at the start. We can conclude two things from this: If the weights in the neural network start too small, the signal shrinks as it passes through each layer until it s so small that the learning process stops. If the weights in the neural network start too big, the signal grows as it passes through each layer until it is so large that the learning process stops. We need to make sure the initial weights are just right by making sure the spread of the weights is equal through every layer, in other words, we want the variance of the input of every neuron to be equal to the variance of the output. To achieve this, lets have a look at a single neuron in one of the hidden layers in the network. From 2.2 we know that a neuron has an input vector x with n components (equal to the number of incoming connections), a corresponding weight vector θ and an output number h, where h is just a linear combination of x and θ (Equation 2.1). We can calculate the variance of θ i x i, with i n: Var(θ i x i ) = σ 2 (θ i x i ) = E[x i ] 2 Var(θ i ) + E[θ i ] 2 Var(x i ) + Var(x i )Var(θ i ), 4.8 where E is the expected value of a variable. Since we have normalized the features before feeding it to the network and we can choose the mean of the initial weights to be zero, this formula simplifies to: Var(θ i x i ) = Var(x i )Var(θ i ). 4.9 Assuming that the inputs and weights are independent and identically distributed, we can now calculate the variance of the output h using the Bienaymé formula (Hoey & Goetschalckx 2010) and Equation 4.9: Var(h) = Var n θ i x i = nvar(x i )Var(θ i ) i=0 This means that the variance of the output scales with the variance of the input. To make them equal to each other, we just set nvar(θ i ) = 1. We can follow the same steps for the backpropagation signal, arriving at a weight variance of mvar(θ i ) = 1, where m is the number of outgoing connections of the neuron. These two constraints can only be satisfied simultaneously if n = m. This is, of course, not always the case, so we use the average of the two instead. The final weight variance becomes: Var(θ i ) = 2 n + m This derivation was done by Glorot & Bengio (2010) and is called the Xavier initialization (after Glorot s first name). It has shown better overall performance than other methods, such as the one presented in LeCun et al. (1998) (used by Marchetti et al. 2017). Thus, we initialize the weights in the neural network from a normal distribution with zero mean and a standard deviation equal to:

1 What a Neural Network Computes

Neural Networks 1 What a Neural Network Computes To begin with, we will discuss fully connected feed-forward neural networks, also known as multilayer perceptrons. A feedforward neural network consists