Finding the fastest stars in the Galaxy A machine learning approach

Size: px
Start display at page:

Download "Finding the fastest stars in the Galaxy A machine learning approach"

Transcription

1 Finding the fastest stars in the Galaxy A machine learning approach MASTER THESIS MASTER OF SCIENCE in ASTRONOMY & DATA SCIENCE Author : M.Trueba van den Boom Student ID : Supervisors : Assoc.Prof. dr. E. M. Rossi MSc. T. Marchetti 2 nd corrector : Prof. dr. S. Portegies Zwart Leiden, The Netherlands. 27 June 2018

2

3 3 Finding the fastest stars in the Galaxy A machine learning approach M.Trueba van den Boom Leiden Observatory, University of Leiden P.O. Box 9513, 2300 RA Leiden, The Netherlands 27 June 2018 Abstract Hypervelocity stars (HVSs) are stars that surpass the Galactic escape speed at their position, and whose orbits are originated in the Galactic center. These two characteristics make them unique tools to investigate the Galactic potential and the dark matter halo mass distribution, as well as the stellar population of the Galactic center. Unfortunately, only a handful of HVSs have been found so far. In this thesis, we present the use of a binary classifier artificial neural network to search for HVSs in the recently published second data release (DR2) of the European Space Agency satellite Gaia. We only use the five-parameter astrometric solution of the stars to train the network and make predictions accordingly. To optimize the algorithm s performance, we select the numerical hyperparameters with the use of a Bayesian optimization algorithm. After applying the neural network and subsequent observational cuts on the Gaia DR2 subset of 1.3 billion stars that present parallaxes and proper motions, we remain with a clean sample of 140 stars with 80% predicted probability of being HVSs. Based on the total velocity distribution of the stars for which Gaia provides radial velocities, we expect to find 49 high-velocity stars (v tot > 400 km s 1 ) within the final list of candidates. Planned observations with the Very Large Telescope (Chile) and the Isaac Newton Telescope (Spain) will demonstrate the effectiveness of our data mining routine.

4

5 Contents 1 Introduction 7 2 The neural network model Architecture Forward propagation The backpropagation algorithm Overfitting Pre-training setup Building the mock catalog Data Division Feature scaling Hyperparameter tuning Hand chosen selection Choice of the activation function Choice of the cost function Weight initialization Gradient descent optimization method A Bayesian Optimization approach Gaussian Processes and Expected Improvement Optimization results Results Final performance of the algorithm Application to Gaia DR A Monte-Carlo approach for errors Comparison to earlier work Application to Gaia DR Conclusion and future prospects 47 Acknowledgments 49 References 54 A Appendix 55 A.1 Size of the training set A.2 System parameters A.3 Gaia DR2 columns

6 List of Figures 1.1 Velocity-distance for stars in the Galaxy Feed-forward dense neural network architecture Single neuron in the neural network Sigmoidal activation functions Second derivatives of the activations Derivative of the sigmoid logistic activation Gaussian process and expected improvement Learning curves of both catalogs Neural network s output for Gaia DR Hypothesis distributions of HVSs and BSs Relative error bars against discretized hypothesis Parallax and proper motions relative error histograms Number of candidates retrieved from previous studies Networks output for previously found HVSs and RSs Candidates stars selected per threshold Total velocity distribution of the candidate list A.1 Evaluation scores for percentages of training data List of Tables 3.1 Parameters for the assumed Galactic potential Data division into sets Hyperparameter options for the Bayesian Optimization Optimal neural network hyperparameters Confusion matrices for the neural network predictions A.1 System parameters A.2 Description of the Gaia DR2 columns

7 Chapter 1 Introduction Hypervelocity stars (HVSs) are among the fastest objects in the Galaxy. Observationally, they are defined by two main attributes: their orbits are originated in the Galactic center (GC), and their velocity exceeds the Galactic escape speed at their location, making them unbound from the Milky Way (Brown et al. 2005). The Galactic escape speed attains values 600 km s 1 in the central regions of the Galaxy, and drops to 400 km s 1 at Galactocentric distances 50 kpc (Williams et al. 2017). Such extreme radial velocities are enough for a HVS to travel from the GC to the outer halo within its typical lifetime. Because of this rare characteristic, they carry a deep print of the Galactic potential. Studying HVSs can tell us more about the Milky Way s potential and the dark matter halo mass distribution (Rossi et al. 2017), as well as provide crucial information to constrain the poorly understood properties of the stellar population in the GC (Genzel et al. 2010; Zhang et al. 2013; Madigan et al. 2014). HVSs are scarce and hard to detect. Only around 20 unbound stars have been found so far (Brown et al. 2014). Figure 1.1 shows the total velocity in the Galactic rest-frame of high-velocity stars (left) and "normal" background stars (right) as a function of their Galactocentric distance. Only a few points lie above the Galactic escape speed, drawn as a dashed line. These plots show the scarceness of unbound stars at the high-velocity tail of the Galactic velocity distribution. There are several proposed ejection mechanisms that could explain the extreme velocities that characterize HVSs. The Hills mechanism is the most plausible among them (Hills 1988). It involves the gravitational disruption of a binary star system by the Massive Black Hole (MBH) at the center of the Milky Way, Sagittarius A*. This scenario predicts an isotropic distribution of HVSs in the sky, and is consistent with an ejection rate between yr 1. Other ejection mechanisms include the interaction of a globular cluster with a supermassive black hole (Capuzzo-Dolcetta & Fragione 2015), the encounter between a single star and a massive black hole binary (Sesana et al. 2006, 2008), and the tidal disruption of a dwarf galaxy (Abadi et al. 2009). All aforementioned mechanisms also predict ejected stars that do not acquire sufficient energy during the gravitational interaction to exceed the Galactic escape speed. Such stars are bound to the Milky Way, hence they are called bound HVSs (Bromley et al. 2006; Kenyon et al. 2008). Their lower velocities ( km s 1 ) let them adopt a large variety of orbits, preventing easy identification (Marchetti et al. 2017).

8 8 Introduction Figure 1.1: Total velocity in the Galactic rest-frame as a function of Galactocentric distance. The dashed lines show the median posterior escape speed from the Galaxy (Kenyon et al. 2014; Williams et al. 2017). Left: 39 high-velocity stars found by Brown et al. (2018a). The color indicates the probable origin: Galactic center (blue), Galactic disk (red), Galactic Halo (green), and Ambiguous (empty). Right: All the 6,869,707 stars in Gaia DR2 that present radial velocities and for which the relative error on total velocity is smaller than 0.2 (Marchetti et al. 2018a). The colors are proportional to the logarithmic number density of stars. Stars with v 600 km s 1 are artifacts from large uncertainties in the measurements and do not represent real stars. Not all ejection mechanisms are restricted to the GC. High-velocity stars with orbits that are not consistent with coming from the GC are referred to as runaway stars (RSs), or hyper runaway stars (HRSs) if they are unbound from the Galaxy (Blaauw 1961). The two main mechanisms that describe their possible origin are the explosion of a supernova in a compact binary system (Portegies Zwart 2000; Geier et al. 2015) and the dynamical encounters between stars in dense stellar systems (Leonard & Duncan 1990; Gvaramadze et al. 2009). The rate at which HRSs are produced is estimated to be much lower than for HVSs, namely yr 1 (Perets & Šubr 2012; Brown 2015). Since December 2013, the European Space Agency satellite Gaia has been observing stars in the Milky Way (Gaia Collaboration et al. 2016). In September 2016, the first data release (Gaia DR1, Brown et al. 2016; Lindegren et al. 2016) presented parallaxes and proper motions for a subset of approximately 2 million stars in common with the Tycho-2 catalog (TGAS, Michalik et al. 2015). The second data release (Gaia DR2, Katz et al. 2018; Brown et al. 2018b; Lindegren et al. 2018), available to the public since the 25th of April 2018, is the most complete and accurate stellar map of the Milky Way. The full catalog contains over 1.7 billion sources, where a subset of 1.3 billion present the five astrometric parameters. The astrometric solution is given by the right ascension (α), declination (δ), parallax (ϖ) and proper motions in both directions of the sky (µ α,µ δ ). Gaia DR2 presents an enormous and unique data set with stars of still unknown nature. Where previous analysis expected Gaia DR1 to contain no more than a couple of HVSs, the expected number of HVSs in Gaia DR2 lies between the hundreds and the thousands (Marchetti et al. 2018b). Gaia DR2 also

9 9 provides radial velocities for roughly 7 million sources. For these stars, the total velocity can be computed directly. This work has been done recently by Marchetti et al. (2018a), finding 5 HVS candidates and 23 HRS candidates. Other recent studies have looked at the complete catalog and already claim to have found new unbound objects (Boubert et al. 2018; Shen et al. 2018). We have mentioned the important role HVSs play in order to provide valuable information about the Galactic potential and the stellar characteristics of the GC. Previous studies have primarily tried to find HVSs looking for young stars in the outer halo. Since this is not a stellar forming region, it would mean that those stars had come from another region, making them outliers, easy to detect. This strategy, however, induced an observational bias towards the massive B-type stars that were systematically targeted. Gaia provides an unprecedented opportunity to continue the search for HVSs using a different approach. To process such a large amount of data, we need an automated data mining routine capable of detecting a few outliers within more than a billion stars. Marchetti et al. (2017) present a novel technique based on the use of an artificial neural network to search for HVSs in the TGAS subset of Gaia DR1. Artificial neural networks are a type of supervised machine learning, capable of identifying hidden patterns in multi-dimensional parameter structures in order to make classifications, making them suitable tools for the problem at hand. Although their origins go back to the 1940s, neural networks were scarcely used before the 1990s due to their expensive computations and the absence of proper algorithms to train them. With the increasing improvements on computational power and the development of the graphics processing units (GPUs), the popularity of neural networks has risen fast. Nowadays, they are efficient and effective algorithms, widely used for both commercial purposes and scientific research. In this thesis, we present a careful analysis and optimization of the novel technique presented by Marchetti et al. (2017) to search for HVSs in Gaia with the help of an artificial neural network. Furthermore, we continue and expand his work, applying the algorithm to Gaia DR1, comparing the results, and finally, searching for HVSs in the recently released Gaia DR2. To avoid photometric and metallicity cuts that could bias our search towards specific stellar types, we choose to only use the five astrometric parameters as the input in our data mining routine. Restricting ourselves to a search only based on astrometry has the advantage that it prevents any a priori assumption on the stellar nature of the HVSs. Since we are only interested in knowing if a star is a HVS or not, we implemented a binary classifier neural network, which outputs a scalar in the range [0,1], were 0 corresponds to a background star (BS, all stars that are not HVSs) and 1 to a HVS. This thesis is outlined as follows: in Chapter 2 we give a short introduction onto the basics of neural networks, in Chapter 3 we discuss the data set used to train the algorithm, and in Chapter 4 we select the network s hyperparameters. Then, the neural network is applied on Gaia DR1 and DR2 in Chapter 5, and at the end, we discuss our conclusions and present possible future work in Chapter 6.

10

11 Chapter 2 The neural network model In this Chapter we give a small introduction on the fundamentals of artificial neural networks, using the one used by Marchetti et al. (2017) as an example. For a more indepth explanation about neural networks and machine learning in general, we refer the interested readers to Haykin (1998) and Haykin (2009). 2.1 Architecture Our brain uses lots of densely interconnected neurons for information processing. The idea behind an artificial neural network is to simulate this extremely large network of brain cells inside a computer, in order to get it to learn, recognize, and make decisions in a "human-like" way. A neural network is not explicitly programmed to learn, it does so by itself. A typical neural network can have anything from a few dozen to thousands of artificial neurons arranged in a series of layers. How the layers are interconnected depend mostly on the purpose of the network. For classification problems such as the one presented in this thesis, all neurons in a layer are connected to those in the next layer. This type of architecture is called a feed-forward dense network, since there are no connections within the same layer and the many connections create a dense web between the layers. Figure 2.1 shows a schematic drawing of a feed-forward dense network with four layers. The first layer, or input layer, contains one neuron for every feature x that needs to be fed to the network. In this case five, one for every astrometric parameter of the star. Neurons in this layer are called input neurons. The last layer, or output layer, contains a single neuron that signals how the network has responded to the learned information. Neurons in this layer are called output neurons. In between the input and output layers, there can be one or more hidden layers that form the body of the network. The name is derived from the fact that the input and output of the neurons in these layers are hidden from the interface of the network. The connections between neurons are represented by a number, which can be either positive or negative, depending on the influence (or weight) one neuron has on the other. In analogy to the brain, where the junction between two neurons is called a synapse, the numbers are referred to as synaptic weights, or simply as weights.

12 12 The neural network model Figure 2.1: Schematic figure of the architecture of a feed-forward dense neural network. It consists of an input layer with features x 1, x 2...x n, two hidden layers, and an output layer with a single neuron for binary classification. The +1 neurons represent the bias units. The lines interconnecting the neurons are called synaptic weights and represent the weighted numbers between neurons. 2.2 Forward propagation We have seen that a neural network consists of many neurons stacked in layers. But how does a single neuron work? The input neurons work differently than those in the hidden and output layers. They simply receive the astrometric features from the star and feed it, together with the corresponding synaptic weights, to the neurons in the first hidden layer. Contrary to the input neurons, the hidden neurons take several inputs from the previous layer. To compute the output, the neuron calculates a non-linear function f (z), where z is a linear combination of the input vector x = (x 1, x 2,..., x n ) times the corresponding weight vector θ = (θ 1,θ 2,...,θ n ): z( x; M θ) = x 0 θ 0 + x i θ i, 2.1 where M is the number of neurons in the previous layer and x 0 1 is called the bias unit, an extra neuron added to each pre-output layer. Bias units aren t connected to any previous layer so they don t represent a true activity, but they still have outgoing connections and therefore contribute to the output of the neuron by adding a constant term that enables the function f to shift, where the summed part of z only is able to change its slope. This may be critical for successful learning. i=1 The function f (z) is called the activation function and does non-linear functional mappings of the variable z. It is the combination of many non-linear functions what enables the neural network to recreate any arbitrary complex function to fit the data properly. Marchetti et al. (2017) choose for a variation of the standard hyperbolic tangent function for the activation of the neurons in his network:

13 2.3 The backpropagation algorithm 13 Figure 2.2: Schematic figure of a single artificial neuron in the neural network. The neuron calculates a non-linear function f (z), where z is a linear combination of a set of inputs (x 1, x 2...x n ) together with the bias unit (+1), and their corresponding synaptic weights (θ 0,θ 1...θ n ). It sums them and uses the activation function to map the output into a desirable range. f (z) = a tanh(bz), 2.2 with a = and b = 2/3 (LeCun et al. 1998). In 4.1.1, we will compare this choice with other activation functions and discuss their advantages and disadvantages for classification problems. The resulting output of the activation function is fed to the neuron in the next layer. The output of the neuron in the last layer h Θ ( x) is called the hypothesis, and depends on the features used and the ensemble of weights in the network. For binary classification problems, it is useful to map the hypothesis in a range [0,1], where 1 corresponds to a HVS and 0 to a BS. This way, we can interpret the output of the neural network as a star s probability of being a HVS. The synaptic weights of a layer l (except the output) can be arranged as a matrix 1 Θ (l) R s l+1 (s l +1), where the matrix dimension is given by the number of neurons in the next layer s l+1 and the number of neurons, including the bias unit, in its own layer s l + 1. The concatenation of all these matrices gives the ensemble of weights in the network Θ = (Θ 0,Θ 1,...,Θ L 1 ), where L is the total number of layers in the network. 2.3 The backpropagation algorithm When a trained neural network makes predictions on unlabeled data, the information only flows forward in the net, from the input features to the output hypothesis. But during training, there is an element of feedback involved that teaches the network to adapt to the optimal weights. For this, the information also needs to propagate backward through the layers, hence the name of the algorithm. After the hypothesis is predicted, it is compared to the true value of the labeled data using the mean squared error (MSE) cost function: 1 In the following, we will use superscripts in round brackets to refer to a particular vector, and subscripts to specify its components.

14 14 The neural network model J(Θ) = 1 2 m i=1 ( h Θ ( x (i ) ) y (i )) The MSE is the sum of the Euclidean distance between the hypothesis and the true value of star i for every star in the batch (see Chapter 4 for a complete list of the network s hyperparameters and their descriptions). From this follows immediately that we need to minimize the cost function in order to optimize the network s results. There are many more cost functions than the MSE, but they all meet that same requirement. In 4.1.2, we compare the MSE to another function, more suitable for binary classification. In order to minimize the function J(Θ), we need its partial derivative with respect to every weight in the network. This expression tells us how quickly the loss changes when we change the weights. To get the derivative, we first calculate the error δ of every neuron j for every star i. For simplicity, we will omit the specific star index annotation from Equation Equation 2.7. The error in the output layer L is given by: δ (L) = J(Θ) a (L) f ( z (L)), 2.4 with a (L) the vector of outputs of the activation for the last layer, and f the derivative of the activation function f, evaluated with the input values given by z (L). Note that for the MSE cost function, this equation simply becomes the difference between the predicted and the true output of a star times the derivative of the activation function: δ (L) = ( a (L) y ) f ( z (l)). 2.5 Since the errors depend on both the error criteria and the activation function, we can calculate δ in every previous layer l with: δ (l) j = (( Θ (l+1)) δ (l+1) ) f ( z (l)), 2.6 where denotes the transpose of the underlying matrix and indicates a terms-wise matrix multiplication, or Hadamard product. Finally, we get the partial derivative of the cost function with: Θ (l) j J(Θ) = a (l 1) j δ (l) δ (l) j if j 0 j if j = 0, 2.7 since j = 0 is a bias unit and there are no connections feeding it. The weights are then updated with a gradient descent method, an algorithm that tries to find the minimum of a function, which in this case is the cost function J(Θ). Marchetti et al. (2017) use an adaptive stochastic gradient descent method with a specific learning rate η k for the t-th update of the k-th weight in the network: θ k (t) = η k (t)g k (t) = t i=1 η ( gk (i ) ) g k(t), 2.8 2

15 2.4 Overfitting 15 where η = η 0 /(1+γt) 2 and g is the gradient of the cost function. This implementation is called the AdaGrad optimizer (Duchi et al. 2010). It contains two hyperparameters: the learning rate η, that decides the step-size of the weight update, and the learning rate decay γ, that diminishes the learning rate with every iteration to ensure the algorithm converges towards the minimum and does not scatter due to random noise in the batches. In 4.1.4, we discuss the application of different optimization methods. 2.4 Overfitting When implementing a neural network, it is critical to make sure that the algorithm doesn t unknowingly extract some of the noise in the data as if that variation represented underlying model structure. This problem is called overfitting, and it is one of the major problems in neural networks. The non-linear mapping of the algorithm can, in theory, fit every distribution in the total feature-space, correctly classifying outliers on the training set but failing to generalize on the test set (Tetko et al. 1995; Lawrence et al. 1997). One must be careful to avoid this in order to achieve maximal performance on real, unlabeled data. Regularization techniques counteract overfitting. Three frequently applied methods are l 1, l 2, and dropout. l 1 and l 2 regularization add an extra term to the cost function in order to suppress the influence larger weights can have on the network s mapping. Since the cost function needs to be small, the summation of large terms makes it harder for the gradient descent algorithm to minimize the function. They are given by: l 1 J(Θ) = J(Θ) + λ l 2 J(Θ) = J(Θ) + λ L l=1 j =0 L l=1 j =0 m θ l,j, 2.9 m θ 2 l,j, where L stands for the number of layers, m the number of neurons in layer l, and λ is a regularization parameter that decides the scale of the penalization. Although the two methods look pretty similar, their functionality differs significantly. The absolute term in l 1 penalizes all weights linearly the further they get from zero, while the quadratic term in l 2 penalizes larger weights much more than smaller ones, forcing all the weights to be as small as possible (Ng 2004; Moore & DeNero 2011). Dropout is a technique where a fraction of the neurons in a layer is ignored during every iteration (Srivastava et al. 2014; Helmbold & Long 2016). Deactivating neurons randomly prevents the network from co-adapting and relying too much on some specific neurons for the training. This simple technique reduces overfitting, often better than regularization (Phaisangittisagul 2016). Usual values for dropout go up to 0.2 (20% of the neurons in the layer are dropped). In we discuss how we implemented the three regularization techniques in the neural network. 2 Marchetti et al. (2017) use γ = 0, therefore the denominator becomes 1 and η = η 0.

16

17 Chapter 3 Pre-training setup 3.1 Building the mock catalog The neural network needs a significant amount of training samples in order to find the relevant boundaries to correctly classify the stars. Since the HVS - BS ratio estimate in Gaia is only 1:10 6 (Marchetti et al. 2018b), using a natural distribution of the stars would result in very imbalanced classes, which usually prevents good classification performances (He & Garcia 2009). To avoid this, we need to oversample the minority class in the training, which can only be achieved building a mock population of HVSs. We will train the neural network on two independent catalogs, each containing a different population of HVSs distributed around the Galaxy. The first one is the original catalog used by Marchetti et al. (2017), and the second one is called the Hills catalog 1. In this section, we will briefly outline the methods used by both catalogs to create the mock population of HVSs. For a more extended explanation, we refer the interested reader to Marchetti et al. (2018b). Both catalogs use the Hills mechanism to model the initial velocity distribution of stars ejected from the GC. We start by populating the GC with a synthetic population of binary stars, following the method outlined by Rossi et al. (2014, 2017). The binary distributions are modeled as power-laws: f a a 1 for the semi-major axis a, where the exponent is chosen according to Öpik s law (Öpik 1924), and f q q 3.5 for the mass ratio q. This combination results in a good fit between the observed sample of late B-type HVSs in Brown et al. (2014) and the prediction of the Hills mechanism for reasonable choices of Galactic potentials (Rossi et al. 2017). After being disrupted, one of the stars from the binary is ejected with a velocity v e j (Sari et al. 2010; Kobayashi et al. 2012; Rossi et al. 2014): v e j = 2Gm c a ( M m t ) 1/6, 3.1 where M = M is the mass of the MBH in the Milky Way (Ghez et al. 2008; Gillessen et al. 2009; Meyer et al. 2012), m t is the total mass of the binary, m c is the 1 The Hills catalog was generated using a publicly available PYTHON module that can be found in

18 18 Pre-training setup mass of the companion star (not the one ejected), and G is the gravitational constant. The original catalog directly populates the galactic space (l, b) with HVSs in a mass range M [0.1 9]M and at a distance d [0,40]kpc from the Earth, assuming steps of l 0.9, b 4.5, r 1 kpc and M 0.2M. Then, it computes the proper motions and radial velocities which are consistent with an object moving away from the GC at that position, correcting for the motion of the Sun and of the local standard of rest (Schönrich 2012). From the ejection velocity distribution, it is possible to make an estimation of the flight time of that star, and subsequently calculate the age of the star. Using the stellar evolution code SeBa (Portegies Zwart & Verbunt 1996; Portegies Zwart et al. 2009), we can then obtain the radius, effective temperature, and other stellar parameters. The total velocity is computed decelerating the stars in a four-component Galactic potential, consisting of a point mass black hole potential: Φ B H (r GC ) = GM r GC, 3.2 a spherically symmetric bulge modeled as a Hernquist spheroid (Hernquist 1990): Φ b (r GC ) = GM b r GC + r b, 3.3 a Miyamoto-Nagai disc in cylindrical coordinates (R GC, z GC ) (Miyamoto & Nagai 1975): GM d Φ d (R GC, z GC ) = ), R 2 GC (a + d + z 2 GC + b2 d and a Navarro-Frenk-White (NFW) dark matter halo profile (Navarro 1996): Φ(r GC ) = GM ( h ln 1 + r ) GC. 3.5 r GC r s Here, r GC denotes the distance from the star to the GC. The potential parameters of the mass and radius for the bulge and the disc are taken from Johnston et al. (1995), Price-Whelan et al. (2014) and Hawkins et al. (2015). The parameters for the NFW halo profile are the best-fit values derived in Rossi et al. (2017). All parameters are summarized in Table 3.1. The Hills catalog uses a more realistic approach. It performs a full orbital integration of the HVSs in the Milky Way, from the moment the binary disruption takes place. The initial positions of the stars are at a distance close to the radius of the gravitational sphere of the MBH s influence, with a velocity vector pointing radially away from the GC and a magnitude according to Equation 3.1. The orbits are integrated with the Galpy package 2 (Bovy 2015), using a Dormand-Prince integrator (Dormand & Prince 1980). The integrated time is equal to t(m) = ɛ 1 ɛ 2 t MS (M), where M is the mass of the star, t MS (M) its main sequence lifetime, and ɛ 1,ɛ 2 are two random numbers uniformly distributed in [0,1]. At the end of the integration, we get the velocity at the star s observed position. A thing that has to be accounted for, is that Gaia 2

19 3.2 Data Division 19 Component Bulge Disk Halo Parameters M b = M r b = 0.7 kpc M d = M a d = 6.5 kpc b d = 0.26 kpc M h = M r s = 24.8 kpc Table 3.1: Parameters used for the four-component Galactic potential. only provides parallaxes and proper motions for stars with a Gaia G band magnitude (G mag ) 21. This limit will be crossed by many samples in the mock-star generator. For this reason, we remove all stars with G mag > 21 at the end of the integration. The original Gaia data contains negative parallaxes. To account for this in the mock data, the distance d is transformed to parallax as follows. First, the parallax ϖ = 1/d and the relative error in parallax z ϖ σ ϖ /ϖ are estimated using the PyGaia toolkit 3. Then, for every distance, the parallax is drawn out of a Gaussian distribution centered at ϖ and with a standard deviation σ ϖ. This way, a part of the distribution can cross zero and enter the negative regime, including so negative parallaxes in the catalog for faint stars with non-negligible relative errors. The Hills catalog has two major differences from the original catalog. The orbital integration allows the orbits of HVSs to differ from straight lines due to the torque applied on them by the axisymmetric stellar disk. Stars ejected at perpendicular angles with respect to the disk will present straighter orbits than stars at lower angles. Because of this, not all HVSs will point radially away from the GC, but will cover the whole space instead. The second large difference is that the velocity distribution of the stars in the Hills catalog peaks at 400 km s 1. This means that the catalog will also contain bound HVSs, which are not present in the original catalog. We create a Hills catalog containing HVSs, which we fill with randomly picked background stars from a simulated end-of-mission Gaia catalog for the Milky Way: the Gaia Universe Model Snapshot (GUMS, Robin et al. 2012). In the end, the complete catalog consists of 1.8 million stars, with a 22% - 78% ratio of HVS - BS respectively (Table 3.2). See Appendix A.1 for a clarification on the size of the catalog. 3.2 Data Division We randomly split both catalogs into three sets: a training set, a cross-validation set, and a test set. The training set is used for the final training of the neural network. 3

20 20 Pre-training setup This set has to be the largest of the three because the algorithm needs a vast sample of stars to be able to correctly draw a boundary line in the five-dimensional feature space. The cross-validation set is used to train the network with the Bayesian Optimizer algorithm in order to select the optimal hyperparameters (see 4.2). This training cannot be done on the training set to avoid selecting hyperparameters biased towards a specific group of stars. Lastly, the test set is used to measure the performance of the algorithm (see 5.1). This also needs to be done on an independent set to detect when the algorithm is overfitting the training set. The set s statistics are laid out in Table 3.2. All sets maintain a 1:3.5 HVS to BS ratio. Data sets % of data HVS BS Total stars Hills catalog Training set 80% 319,364 1,117,885 1,437,249 Cross-validation set 10% 39, , ,656 Test set 10% 40, , ,657 Total 100% 399,236 1,397,326 1,796,562 Original catalog Training set 60% 415,013 1,102,549 1,517,569 Cross-validation set 20% 138, , ,588 Test set 20% 138, , ,181 Total 100% 692,268 1,837,063 2,529,338 Table 3.2: Statistics of the data division into training, cross-validation and test set for both catalogs. We need different sets for the training, hyperparameter search and the performance testing to avoid biases. The reason for the difference in size and the distribution of the sets between the two catalogs is explained in Appendix A Feature scaling Before feeding the astrometric features to the neural network, they are normalized to a mean of zero and a standard deviation of one, using the following equation: x i j = x i j µ j σ j 3.6 where x i j is the j-th feature of star i. µ j and σ j are respectively the mean and standard deviation of feature j. This process is necessary because the values of the parameters can vary widely, which can lead objective functions in the network to fail. More reasons to use feature scaling include a faster convergence of the gradient descent algorithm ( 2.3, Ioffe & Szegedy 2015) and an optimal weight initialization ( 4.1.3, Glorot & Bengio 2010).

21 Chapter 4 Hyperparameter tuning A neural network has a set of parameters that need to be carefully tuned in order to get an optimal performance, the so-called hyperparameters (Bengio 2012). Choosing the hyperparameters can be seen as model selection or as the initial conditions of the algorithm. Neural networks can have many, including those which specify the structure of the network itself and those which determine how the network is trained. We identify the following hyperparameters in our network: Number of hidden layers and the number of neurons per hidden layer. These hyperparameters determine the complexity of the non-linear mapping function. More layers and neurons will be able to capture easier the structure of the relevant instances. On the other hand, this also implies that the training will be more computationally expensive and the algorithm will be prompt to overfitting. This can be combated adding regularization to the network. Number of epochs. Looping over all the instances only once doesn t ensure a convergence of gradient descent. When the last star has been fed to the network and the weights have been updated, the training set is shuffled and the process starts again. Every loop over the full training set is called an epoch. The more epochs it is trained, the closer the network reaches convergence. This can also cause overfitting. Batch size. During training, the stars are fed to the neural network in batches. A batch is a group of stars that are processed at the same time. The batch size can range from all the stars in the training set (batch gradient descent) to a few stars (mini-batch gradient descent) or even a single star (stochastic gradient descent). After processing a batch, the network updates its weights to improve the predictions on the next batch. Weight initialization. At the beginning of the training, the weights are given random initial values. Depending on these values, the network can converge faster or get more easily stuck at a local minimum (Kumar 2017). Cost function. The cost function is the direct in-training measure for the performance of the network. Although they generally have the same properties, some are more suited for classification tasks than others.

22 22 Hyperparameter tuning Activation function. As discussed in 2.2, the activation function plays a big role in the training of the neural network. Only a careful choice of the activation per layer ensures maximum performance. Optimization method. Gradient descent optimizers determine how the weights are updated after every iteration. Standard optimization methods include the learning rate and decay hyperparameters. Other methods can contain extra parameters like the exponential decay rate over different moments. We emphasize that only a correct combination of hyperparameters leads to the optimal performance of the network. Unfortunately, there is not a standard recipe to choose the hyperparameters of a neural network. They are usually selected by hand or by a search algorithm. Ideally, we would perform a systematic grid search in the total hyperparameter space to find the best combination. Unfortunately, we have to train the network for every combination we try, which requires more computational power and time than we have at our disposal. To surpass this issue, Marchetti et al. (2017) use a Particle Swarm Optimization algorithm (PSO, Kennedy & Eberhart 1995) in order to find three hyperparameters: number of neurons in the first hidden layer, number of neurons in the second hidden layer, and the learning rate of the optimizer. The rest of the parameters are chosen a priori. Our model contains 17 hyperparameters (depending on the number of layers, see Table 4.2), which is too many to perform a PSO. Instead, we first choose by hand those whose characteristics are known well enough to make a confident choice, and after that, we select the numerical ones using a Bayesian Optimization algorithm, which is more suitable than the PSO for the optimization of many parameters, as it needs less trials in order to achieve a comparable result. 4.1 Hand chosen selection We start first with the hyperparameters that we can select a priori. In the following sections, we explain the parameter choices and the reasons for those decisions Choice of the activation function The choice of activation functions in neural networks has a significant effect on the training dynamics. Over the years, many different activation methods have been developed, which makes choosing the right one no easy task (Mhaskar & Micchelli 1994; Ramachandran et al. 2017). In this section, we will analyze three popular choices and select the most suitable for our network. Activation functions usually satisfy a series of useful properties: they are bounded, monotonic and continuous in their range. The typical activation choices for classification tasks are sigmoid functions, such as the logistic: f (z) = 1 (1 + e z ), 4.1

23 4.1 Hand chosen selection 23 f(z) Sigmoid logistic function Standard hyperbolic tangent Hyperbolic tangent (LeCun et al. 1998) f(z) = 1 (1+e z ) (-1.317, 0.211) (-0.658, ) (-0.988, ) (0.988, 0.991) (1.317, 0.789) (0.658, 0.577) f(z) = tanh(z) 1.5 f(z) = tanh( 2z 3 ) z Figure 4.1: Typical activation choices for classification tasks: the sigmoid logistic (blue), the standard hyperbolic tangent (red), and a variation on the hyperbolic tangent (purple). The points of the maximum second derivatives of the functions are marked with black dots. In the case of the logistic function, they correspond to the values and on the y-axis, which become the true values for the BSs and HVSs respectively. or the hyperbolic tangent f (z) = tanh(z). Figure 4.1 shows the logistic (blue) and hyperbolic (red) sigmoid functions. They look rather similar. In fact, tanh is just a scaled logistic function: tanh(z) = 2 logistic(2z) 1. LeCun et al. (1998) propose a customized tanh function (Equation 2.2, plotted purple in Figure 4.1), more efficient than the standard version. The constants a = and b = 2/3 have been chosen so, that when used with scaled features, the variance of the outputs is close to 1 because the effective gain of the function is roughly 1 over its full range. This choice has been proven to yield faster convergence. For all three activations, the gradient is highest at the functions center, which means that any small changes in the values of z in that region will cause a significant change in the values of f (z). This means that the functions have a tendency to bring the f (z) values to either end of the curve. For this reason, they are often used for classification purposes, since they make a clear distinction between predictions. A disadvantage of the tanh activation is that it ranges in [-1,1], instead of the desired [0,1] for a probabilistic interpretation of the output 1 (Saerens et al. 2002). Instead of mapping the result of the tanh to the right range after the output, we prefer a more natural choice and select the logistic function for the last layer of our network. This is also necessary to use the binary cross-entropy cost function described 1 There are no probabilistic elements in a neural network, so the outcomes are not really probabilities, but they can be interpreted as such.

24 24 Hyperparameter tuning f (z) f (z) = tanh( 2z 3 )sech2 ( 2z 3 ) (-0.988, 0.587) (-1.317, 0.096) f (z) = ez (e z 1) (e z +1) 3 (-0.658, 0.770) Sigmoid logistic function Standard hyperbolic tangent Hyperbolic tangent (LeCun et al. 1998) f (z) = 2 tanh( z)sech 2 ( z) (0.988, ) (1.317, ) (0.658, ) z Figure 4.2: Second derivatives of the three activation functions of Figure 4.1. The points of the extrema are marked with black dots. The corresponding y values in the original activation to the x positions of these points, mark the optimal true values for the stars in the training set. in Therefore, we select the variation on the hyperbolic tangent activation for all hidden layers, and the sigmoid logistic activation for the output layer. Contrary to what one might naively suspect, the true values of the stars in the training set should not be placed at the value of the sigmoid s asymptote. The training will try to push the outputs towards them, resulting in ever increasing weights, where the derivative of the activation is close to zero. This will, in turn, get multiplied with the weights, preventing their proper update. This process will stop, or at least drastically slow down the learning process (computations often hit floating point value limits), while the reduction of the classification errors was still possible. This problem in deep learning is known as the vanishing gradient problem (Bengio et al. 1994; Hochreiter et al. 2001). Instead, we choose the true values at the point of the maximum second derivative of the function. This way we take advantage of the nonlinearity without saturating the activation and maintaining an indication of the prediction s uncertainty (LeCun et al. 1998). Figure 4.2 shows the second derivative for all three activations. The point of the second derivative of the logistic function corresponds to the values and Those become the true values for the BSs and HVSs in the training set Choice of the cost function The MSE (Equation 2.3) is a typical choice for the cost function of a neural network, and performs fairly well for shallow nets (Zhou & Austin 1998). Nevertheless, it presents a problem while training that drives the gradients to vanish. The problem

25 4.1 Hand chosen selection f (z) = e z (1+e z ) 2 f (z) z Figure 4.3: First derivative of the sigmoid logistic activation. The function has a minimum at two points, meaning that the error of the neuron can adopt low values not only when it is classified correctly, but also when it has the exact opposite value. This will stop the learning process where it should continue, and in time, it will drive the gradient to values close to zero, creating a vanishing gradient problem. arises when we compute the gradient of the MSE with the backpropagation algorithm (Equation 2.4): m Θ J(Θ) = i=1 ( h Θ ( x (i ) ) y (i )) h Θ( x (i ) ) x (i ). 4.2 We can see appear a term h. This term is the derivative of the hypothesis, and adopts the form of the derivative of the activation function in the output layer. The derivative of the sigmoid logistic function is given by: f (z) = 1 z 1 + e z = e z (1 + e z = f (z) (1 f (z)). 4.3 ) 2 The right side of the equation will prove to be useful later on (Minai & Williams 1993). If we plot the gradient of the logistic function (Figure 4.3), we see a quadratic function that adopts a minimum at two points. This means that the function can attain small values not only when the output of the neuron is nearly optimal, but also when it comes close to the opposite value. This can cause the gradient to adopt values really close to zero, leading again to the vanishing gradient problem. A different cost function that solves this problem, and is, at the same time, more suitable for classification problems, is the cross-entropy function (Ibrahim & Mohamed 2017): J(Θ) = K i=1 [ y (i ) log ( h Θ ( x (i ) ))], 4.4

26 26 Hyperparameter tuning where i this time sums over all the possible classification types K. Rewriting this formula for a binary classifier (K = 2) and adding the batch elements, yields the binary cross-entropy cost function: J(Θ) = 1 m m i=1 [ y (i ) log ( h Θ ( x (i ) )) + ( 1 y (i )) log ( 1 h Θ ( x (i ) ))]. 4.5 Although it might look more complicated, it simply states -log P[data model] - log P[data model], an equivalent of Equation 4.4 for binary data. Note that the tanh(x) activation function and its variations can t work with this cost function since there would appear negative values in the logarithmic term. We now calculate the gradient of the binary cross-entropy function applying the chain rule twice: Θ J(Θ) = 1 m [ m i=1 y (i ) ( h ) Θ x (i ) 1 y (i ) ( 1 h ) Θ x (i ) ] h ( Θ x (i ) ) x (i ) = 1 [ m h Θ( x (i ) ) x (i ) ]( ( m i=1 h )( Θ x (i ) 1 h ( )) ( h Θ x (i ) Θ x (i ) ) ) y (i ). 4.6 Using the fact that f (z)/ z = f (z) (1 f (z)) (Equation 4.3) for the sigmoid logistic activation function, the derivative becomes: Θ J(Θ) = 1 m m i=1 (h Θ ( x (i ) ) y (i ) ) x (i ). 4.7 It now scales directly with the distance between the true value and the predicted output of the stars, without the h term! In fact, the cross-entropy function was designed specifically to have this nice attribute. This allows it to converge faster and often better than the MSE (Golik et al. 2013). We therefore choose the binary crossentropy cost function for the training of the neural network Weight initialization As we will see in a moment, the initialization of the weights play a crucial role in the amount of time deep neural networks need to converge. For shallow nets, this hyperparameter plays a lesser role. Although we haven t yet chosen the number of layers in our network, we know from comparable studies that we won t need more than four or five hidden layers since our number of features is very small (convolutional networks can use up to thousands of features). Nevertheless, it is worth to take some time to explain how we choose our initialization method. We know from Equation 2.1 that the linear combination z of a neuron depends on the weight vector θ. If the elements in θ are too large or too small, z will also be large/small and f (z) will become so as well, arriving at one of the asymptotes of the activation where the gradient is close to zero. If we reach this point, the network loses its non-linearity and therefore the advantage of having many layers. Weights located over the activation s linear region are large enough to continue the learning process

27 4.1 Hand chosen selection 27 without saturating the sigmoid, and have the advantage that the network will learn the linear part of the mapping before the more difficult non-linear part. We also know that the nature of the sigmoid functions will drive small weights to get smaller after every layer and large weights to become larger. This means that the more layers, the smaller θ needs to be at the start. We can conclude two things from this: If the weights in the neural network start too small, the signal shrinks as it passes through each layer until it s so small that the learning process stops. If the weights in the neural network start too big, the signal grows as it passes through each layer until it is so large that the learning process stops. We need to make sure the initial weights are just right by making sure the spread of the weights is equal through every layer, in other words, we want the variance of the input of every neuron to be equal to the variance of the output. To achieve this, lets have a look at a single neuron in one of the hidden layers in the network. From 2.2 we know that a neuron has an input vector x with n components (equal to the number of incoming connections), a corresponding weight vector θ and an output number h, where h is just a linear combination of x and θ (Equation 2.1). We can calculate the variance of θ i x i, with i n: Var(θ i x i ) = σ 2 (θ i x i ) = E[x i ] 2 Var(θ i ) + E[θ i ] 2 Var(x i ) + Var(x i )Var(θ i ), 4.8 where E is the expected value of a variable. Since we have normalized the features before feeding it to the network and we can choose the mean of the initial weights to be zero, this formula simplifies to: Var(θ i x i ) = Var(x i )Var(θ i ). 4.9 Assuming that the inputs and weights are independent and identically distributed, we can now calculate the variance of the output h using the Bienaymé formula (Hoey & Goetschalckx 2010) and Equation 4.9: Var(h) = Var n θ i x i = nvar(x i )Var(θ i ) i=0 This means that the variance of the output scales with the variance of the input. To make them equal to each other, we just set nvar(θ i ) = 1. We can follow the same steps for the backpropagation signal, arriving at a weight variance of mvar(θ i ) = 1, where m is the number of outgoing connections of the neuron. These two constraints can only be satisfied simultaneously if n = m. This is, of course, not always the case, so we use the average of the two instead. The final weight variance becomes: Var(θ i ) = 2 n + m This derivation was done by Glorot & Bengio (2010) and is called the Xavier initialization (after Glorot s first name). It has shown better overall performance than other methods, such as the one presented in LeCun et al. (1998) (used by Marchetti et al. 2017). Thus, we initialize the weights in the neural network from a normal distribution with zero mean and a standard deviation equal to:

1 What a Neural Network Computes

1 What a Neural Network Computes Neural Networks 1 What a Neural Network Computes To begin with, we will discuss fully connected feed-forward neural networks, also known as multilayer perceptrons. A feedforward neural network consists

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Hypervelocity Stars. A New Probe for Near-Field Cosmology. Omar Contigiani. Supervisor: Dr. E.M. Rossi. Co-supervisor: Msc. T.

Hypervelocity Stars. A New Probe for Near-Field Cosmology. Omar Contigiani. Supervisor: Dr. E.M. Rossi. Co-supervisor: Msc. T. Hypervelocity Stars A New Probe for Near-Field Cosmology Omar Contigiani Student Colloquium, 20/06/2017, Leiden Co-supervisor: Msc. T. Marchetti Supervisor: Dr. E.M. Rossi Cosmic Web Near-Field Cosmology

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann (Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.5. Spring 2010 Instructor: Dr. Masoud Yaghini Outline How the Brain Works Artificial Neural Networks Simple Computing Elements Feed-Forward Networks Perceptrons (Single-layer,

More information

CSC 578 Neural Networks and Deep Learning

CSC 578 Neural Networks and Deep Learning CSC 578 Neural Networks and Deep Learning Fall 2018/19 3. Improving Neural Networks (Some figures adapted from NNDL book) 1 Various Approaches to Improve Neural Networks 1. Cost functions Quadratic Cross

More information

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Lecture - 27 Multilayer Feedforward Neural networks with Sigmoidal

More information

Introduction to Neural Networks

Introduction to Neural Networks CUONG TUAN NGUYEN SEIJI HOTTA MASAKI NAKAGAWA Tokyo University of Agriculture and Technology Copyright by Nguyen, Hotta and Nakagawa 1 Pattern classification Which category of an input? Example: Character

More information

Lecture 7 Artificial neural networks: Supervised learning

Lecture 7 Artificial neural networks: Supervised learning Lecture 7 Artificial neural networks: Supervised learning Introduction, or how the brain works The neuron as a simple computing element The perceptron Multilayer neural networks Accelerated learning in

More information

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions BACK-PROPAGATION NETWORKS Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks Cannot approximate (learn) non-linear functions Difficult (if not impossible) to design

More information

Machine Learning Basics III

Machine Learning Basics III Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient

More information

10-701/ Machine Learning, Fall

10-701/ Machine Learning, Fall 0-70/5-78 Machine Learning, Fall 2003 Homework 2 Solution If you have questions, please contact Jiayong Zhang .. (Error Function) The sum-of-squares error is the most common training

More information

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others) Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward

More information

ECE521 Lecture 7/8. Logistic Regression

ECE521 Lecture 7/8. Logistic Regression ECE521 Lecture 7/8 Logistic Regression Outline Logistic regression (Continue) A single neuron Learning neural networks Multi-class classification 2 Logistic regression The output of a logistic regression

More information

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Neural Networks. Nicholas Ruozzi University of Texas at Dallas Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify

More information

Deep Learning & Artificial Intelligence WS 2018/2019

Deep Learning & Artificial Intelligence WS 2018/2019 Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target

More information

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others) Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Introduction to Logistic Regression

Introduction to Logistic Regression Introduction to Logistic Regression Guy Lebanon Binary Classification Binary classification is the most basic task in machine learning, and yet the most frequent. Binary classifiers often serve as the

More information

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes CS 6501: Deep Learning for Computer Graphics Basics of Neural Networks Connelly Barnes Overview Simple neural networks Perceptron Feedforward neural networks Multilayer perceptron and properties Autoencoders

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others) Machine Learning Neural Networks (slides from Domingos, Pardo, others) Human Brain Neurons Input-Output Transformation Input Spikes Output Spike Spike (= a brief pulse) (Excitatory Post-Synaptic Potential)

More information

Accretion Disks. Review: Stellar Remnats. Lecture 12: Black Holes & the Milky Way A2020 Prof. Tom Megeath 2/25/10. Review: Creating Stellar Remnants

Accretion Disks. Review: Stellar Remnats. Lecture 12: Black Holes & the Milky Way A2020 Prof. Tom Megeath 2/25/10. Review: Creating Stellar Remnants Lecture 12: Black Holes & the Milky Way A2020 Prof. Tom Megeath Review: Creating Stellar Remnants Binaries may be destroyed in white dwarf supernova Binaries be converted into black holes Review: Stellar

More information

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35 Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Feedforward Neural Nets and Backpropagation

Feedforward Neural Nets and Backpropagation Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28 th, 2016 1 / 23 Supervised Learning Roadmap Supervised Learning: Assume that we are given the features

More information

Comparison of Simulated and Observational Catalogs of Hypervelocity Stars in Various Milky Way Potentials

Comparison of Simulated and Observational Catalogs of Hypervelocity Stars in Various Milky Way Potentials Comparison of Simulated and Observational Catalogs of Hypervelocity Stars in Various Milky Way Potentials Shannon Grammel under the direction of Dr. Paola Rebusco Massachusetts Institute of Technology

More information

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD WHAT IS A NEURAL NETWORK? The simplest definition of a neural network, more properly referred to as an 'artificial' neural network (ANN), is provided

More information

Milky Way s Anisotropy Profile with LAMOST/SDSS and Gaia

Milky Way s Anisotropy Profile with LAMOST/SDSS and Gaia Milky Way s Anisotropy Profile with LAMOST/SDSS and Gaia Shanghai Astronomical Observatory In collaboration with Juntai Shen, Xiang Xiang Xue, Chao Liu, Chris Flynn, Chengqun Yang Contents 1 Stellar Halo

More information

Introduction to Neural Networks

Introduction to Neural Networks Introduction to Neural Networks Philipp Koehn 3 October 207 Linear Models We used before weighted linear combination of feature values h j and weights λ j score(λ, d i ) = j λ j h j (d i ) Such models

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2019 Due: 11am Monday, February 25th, 2019 Submit scan of plots/written responses to Gradebook; submit your

More information

4. Multilayer Perceptrons

4. Multilayer Perceptrons 4. Multilayer Perceptrons This is a supervised error-correction learning algorithm. 1 4.1 Introduction A multilayer feedforward network consists of an input layer, one or more hidden layers, and an output

More information

Ways to make neural networks generalize better

Ways to make neural networks generalize better Ways to make neural networks generalize better Seminar in Deep Learning University of Tartu 04 / 10 / 2014 Pihel Saatmann Topics Overview of ways to improve generalization Limiting the size of the weights

More information

Lecture 6. Regression

Lecture 6. Regression Lecture 6. Regression Prof. Alan Yuille Summer 2014 Outline 1. Introduction to Regression 2. Binary Regression 3. Linear Regression; Polynomial Regression 4. Non-linear Regression; Multilayer Perceptron

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2018 Due: Friday, February 23rd, 2018, 11:55 PM Submit code and report via EEE Dropbox You should submit a

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

The Milky Way Galaxy (ch. 23)

The Milky Way Galaxy (ch. 23) The Milky Way Galaxy (ch. 23) [Exceptions: We won t discuss sec. 23.7 (Galactic Center) much in class, but read it there will probably be a question or a few on it. In following lecture outline, numbers

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Artificial Neural Networks 2

Artificial Neural Networks 2 CSC2515 Machine Learning Sam Roweis Artificial Neural s 2 We saw neural nets for classification. Same idea for regression. ANNs are just adaptive basis regression machines of the form: y k = j w kj σ(b

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Neural Networks Varun Chandola x x 5 Input Outline Contents February 2, 207 Extending Perceptrons 2 Multi Layered Perceptrons 2 2. Generalizing to Multiple Labels.................

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Dark Matter Detection Using Pulsar Timing

Dark Matter Detection Using Pulsar Timing Dark Matter Detection Using Pulsar Timing ABSTRACT An observation program for detecting and studying dark matter subhalos in our galaxy is propsed. The gravitational field of a massive object distorts

More information

ECE521 Lectures 9 Fully Connected Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Artificial Neuron (Perceptron)

Artificial Neuron (Perceptron) 9/6/208 Gradient Descent (GD) Hantao Zhang Deep Learning with Python Reading: https://en.wikipedia.org/wiki/gradient_descent Artificial Neuron (Perceptron) = w T = w 0 0 + + w 2 2 + + w d d where

More information

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017 Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1 Neural Networks Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1 Brains as Computational Devices Brains advantages with respect to digital computers: Massively parallel Fault-tolerant Reliable

More information

Notes on Back Propagation in 4 Lines

Notes on Back Propagation in 4 Lines Notes on Back Propagation in 4 Lines Lili Mou moull12@sei.pku.edu.cn March, 2015 Congratulations! You are reading the clearest explanation of forward and backward propagation I have ever seen. In this

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Multi-Layer Perceptrons Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole

More information

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions? Online Videos FERPA Sign waiver or sit on the sides or in the back Off camera question time before and after lecture Questions? Lecture 1, Slide 1 CS224d Deep NLP Lecture 4: Word Window Classification

More information

Computational statistics

Computational statistics Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial

More information

DETECTING PROCESS STATE CHANGES BY NONLINEAR BLIND SOURCE SEPARATION. Alexandre Iline, Harri Valpola and Erkki Oja

DETECTING PROCESS STATE CHANGES BY NONLINEAR BLIND SOURCE SEPARATION. Alexandre Iline, Harri Valpola and Erkki Oja DETECTING PROCESS STATE CHANGES BY NONLINEAR BLIND SOURCE SEPARATION Alexandre Iline, Harri Valpola and Erkki Oja Laboratory of Computer and Information Science Helsinki University of Technology P.O.Box

More information

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4 Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

LECTURE NOTE #NEW 6 PROF. ALAN YUILLE

LECTURE NOTE #NEW 6 PROF. ALAN YUILLE LECTURE NOTE #NEW 6 PROF. ALAN YUILLE 1. Introduction to Regression Now consider learning the conditional distribution p(y x). This is often easier than learning the likelihood function p(x y) and the

More information

CSCI567 Machine Learning (Fall 2018)

CSCI567 Machine Learning (Fall 2018) CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, 2018 1 / 49 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due

More information

How to do backpropagation in a brain

How to do backpropagation in a brain How to do backpropagation in a brain Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto & Google Inc. Prelude I will start with three slides explaining a popular type of deep

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Machine Learning and Data Mining. Linear regression. Kalev Kask

Machine Learning and Data Mining. Linear regression. Kalev Kask Machine Learning and Data Mining Linear regression Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ Parameters q Learning algorithm Program ( Learner ) Change q Improve performance

More information

CSC321 Lecture 5: Multilayer Perceptrons

CSC321 Lecture 5: Multilayer Perceptrons CSC321 Lecture 5: Multilayer Perceptrons Roger Grosse Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 1 / 21 Overview Recall the simple neuron-like unit: y output output bias i'th weight w 1 w2 w3

More information

Training Neural Networks Practical Issues

Training Neural Networks Practical Issues Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology Fall 2017 Most slides have been adapted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017, and some from

More information

distribution of mass! The rotation curve of the Galaxy ! Stellar relaxation time! Virial theorem! Differential rotation of the stars in the disk

distribution of mass! The rotation curve of the Galaxy ! Stellar relaxation time! Virial theorem! Differential rotation of the stars in the disk Today in Astronomy 142:! The local standard of rest the Milky Way, continued! Rotation curves and the! Stellar relaxation time! Virial theorem! Differential rotation of the stars in the disk distribution

More information

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 27 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions

More information

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Steve Renals Machine Learning Practical MLP Lecture 5 16 October 2018 MLP Lecture 5 / 16 October 2018 Deep Neural Networks

More information

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Motivation Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses

More information

Exam # 3 Tue 12/06/2011 Astronomy 100/190Y Exploring the Universe Fall 11 Instructor: Daniela Calzetti

Exam # 3 Tue 12/06/2011 Astronomy 100/190Y Exploring the Universe Fall 11 Instructor: Daniela Calzetti Exam # 3 Tue 12/06/2011 Astronomy 100/190Y Exploring the Universe Fall 11 Instructor: Daniela Calzetti INSTRUCTIONS: Please, use the `bubble sheet and a pencil # 2 to answer the exam questions, by marking

More information

Chapter 9: The Perceptron

Chapter 9: The Perceptron Chapter 9: The Perceptron 9.1 INTRODUCTION At this point in the book, we have completed all of the exercises that we are going to do with the James program. These exercises have shown that distributed

More information

Lecture - 24 Radial Basis Function Networks: Cover s Theorem

Lecture - 24 Radial Basis Function Networks: Cover s Theorem Neural Network and Applications Prof. S. Sengupta Department of Electronic and Electrical Communication Engineering Indian Institute of Technology, Kharagpur Lecture - 24 Radial Basis Function Networks:

More information

Machine Learning. Neural Networks

Machine Learning. Neural Networks Machine Learning Neural Networks Bryan Pardo, Northwestern University, Machine Learning EECS 349 Fall 2007 Biological Analogy Bryan Pardo, Northwestern University, Machine Learning EECS 349 Fall 2007 THE

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the

More information

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued) Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound Lecture 3: Introduction to Deep Learning (continued) Course Logistics - Update on course registrations - 6 seats left now -

More information

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function Austin Wang Adviser: Xiuyuan Cheng May 4, 2017 1 Abstract This study analyzes how simple recurrent neural

More information

Topic 3: Neural Networks

Topic 3: Neural Networks CS 4850/6850: Introduction to Machine Learning Fall 2018 Topic 3: Neural Networks Instructor: Daniel L. Pimentel-Alarcón c Copyright 2018 3.1 Introduction Neural networks are arguably the main reason why

More information

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21 CS 70 Discrete Mathematics and Probability Theory Fall 205 Lecture 2 Inference In this note we revisit the problem of inference: Given some data or observations from the world, what can we infer about

More information

Course 395: Machine Learning - Lectures

Course 395: Machine Learning - Lectures Course 395: Machine Learning - Lectures Lecture 1-2: Concept Learning (M. Pantic) Lecture 3-4: Decision Trees & CBC Intro (M. Pantic & S. Petridis) Lecture 5-6: Evaluating Hypotheses (S. Petridis) Lecture

More information

Youjun Lu. Na*onal Astronomical Observatory of China Collaborators: Fupeng ZHANG (SYSU) Qingjuan YU (KIAA)

Youjun Lu. Na*onal Astronomical Observatory of China Collaborators: Fupeng ZHANG (SYSU) Qingjuan YU (KIAA) Youjun Lu Na*onal Astronomical Observatory of China 2016.02.08@Aspen Collaborators: Fupeng ZHANG (SYSU) Qingjuan YU (KIAA) 2/11/16 GC conference@aspen 1 Ø Constraining the spin of the massive black hole

More information

Feed-forward Network Functions

Feed-forward Network Functions Feed-forward Network Functions Sargur Srihari Topics 1. Extension of linear models 2. Feed-forward Network Functions 3. Weight-space symmetries 2 Recap of Linear Models Linear Models for Regression, Classification

More information

Gradient-Based Learning. Sargur N. Srihari

Gradient-Based Learning. Sargur N. Srihari Gradient-Based Learning Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation and Other Differentiation

More information

Nonlinear Classification

Nonlinear Classification Nonlinear Classification INFO-4604, Applied Machine Learning University of Colorado Boulder October 5-10, 2017 Prof. Michael Paul Linear Classification Most classifiers we ve seen use linear functions

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Yongjin Park 1 Goal of Feedforward Networks Deep Feedforward Networks are also called as Feedforward neural networks or Multilayer Perceptrons Their Goal: approximate some function

More information

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch. Neural Networks Oliver Schulte - CMPT 726 Bishop PRML Ch. 5 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of biological plausibility We will

More information

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so. CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic

More information

Neural Networks and Ensemble Methods for Classification

Neural Networks and Ensemble Methods for Classification Neural Networks and Ensemble Methods for Classification NEURAL NETWORKS 2 Neural Networks A neural network is a set of connected input/output units (neurons) where each connection has a weight associated

More information

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!! CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!! November 18, 2015 THE EXAM IS CLOSED BOOK. Once the exam has started, SORRY, NO TALKING!!! No, you can t even say see ya

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression

More information

Lecture 9: Generalization

Lecture 9: Generalization Lecture 9: Generalization Roger Grosse 1 Introduction When we train a machine learning model, we don t just want it to learn to model the training data. We want it to generalize to data it hasn t seen

More information

arxiv: v2 [astro-ph.ga] 19 Sep 2018

arxiv: v2 [astro-ph.ga] 19 Sep 2018 Mon. Not. R. Astron. Soc. 000, 1 15 (2017) Printed 20 September 2018 (MN LATEX style file v2.2) Gaia DR2 in 6D: Searching for the fastest stars in the Galaxy T. Marchetti 1, E. M. Rossi 1 and A. G. A.

More information

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression ECE521: Inference Algorithms and Machine Learning University of Toronto Assignment 1: k-nn and Linear Regression TA: Use Piazza for Q&A Due date: Feb 7 midnight, 2017 Electronic submission to: ece521ta@gmailcom

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information