CS 540: Machine Learning Lecture 1: Introduction

Similar documents
Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January,

CS 6375 Machine Learning

Machine Learning Lecture 1

Machine Learning CSE546

CPSC 340: Machine Learning and Data Mining. Linear Least Squares Fall 2016

Introduction. Chapter 1

Introduction to Machine Learning

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Recent Advances in Bayesian Inference Techniques

Machine Learning (COMP-652 and ECSE-608)

ECE521 week 3: 23/26 January 2017

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Mathematical Formulation of Our Example

Learning from Data. Amos Storkey, School of Informatics. Semester 1. amos/lfd/

Lecture 5 Multivariate Linear Regression

CS534 Machine Learning - Spring Final Exam

Machine Learning CSE546

Lecture Machine Learning (past summer semester) If at least one English-speaking student is present.

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Pattern Recognition and Machine Learning

Probabilistic Graphical Models for Image Analysis - Lecture 1

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

CSC321 Lecture 2: Linear Regression

Machine Learning (COMP-652 and ECSE-608)

Machine Learning (CS 567) Lecture 2

Mini-project 2 (really) due today! Turn in a printout of your work at the end of the class

Machine Learning Lecture 2

Non-parametric Methods

PATTERN CLASSIFICATION

Introduction. Le Song. Machine Learning I CSE 6740, Fall 2013

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

ECONOMICS 7200 MODERN TIME SERIES ANALYSIS Econometric Theory and Applications

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2004, Lecture 1: Introduction and Basic Concepts Yann LeCun

Introduction to Machine Learning

Mining Big Data Using Parsimonious Factor and Shrinkage Methods

Based on slides by Richard Zemel

Statistical learning. Chapter 20, Sections 1 4 1

Dynamic models 1 Kalman filters, linearization,

MATH 450: Mathematical statistics

Linear Algebra. Introduction. Marek Petrik 3/23/2017. Many slides adapted from Linear Algebra Lectures by Martin Scharlemann

Probabilistic Graphical Models

Introduction to Gaussian Processes

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Foundations of Machine Learning

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

Probability Based Learning

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Forecasting with Expert Opinions

6.036 midterm review. Wednesday, March 18, 15

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

Active Learning and Optimized Information Gathering

Lecture 4: Probabilistic Learning

Clustering. Léon Bottou COS 424 3/4/2010. NEC Labs America

CS 188: Artificial Intelligence Spring 2009

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture : Probabilistic Machine Learning

Intelligent Systems (AI-2)

CS 188: Artificial Intelligence Spring Announcements

Autonomous Navigation for Flying Robots

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

Machine Learning, Midterm Exam

Introduction to Machine Learning

Gaussians Linear Regression Bias-Variance Tradeoff

CS 188: Artificial Intelligence Fall 2011

ECE-271B. Nuno Vasconcelos ECE Department, UCSD

Probabilistic Graphical Models

Regression III Lecture 1: Preliminary

Linear Models for Regression CS534

L11: Pattern recognition principles

Machine Learning Techniques for Computer Vision

Linear Models for Regression CS534

Machine Learning, Fall 2009: Midterm

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Introduction to machine learning

COURSE INTRODUCTION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Machine Learning. Instructor: Pranjal Awasthi

Statistical Techniques in Robotics (16-831, F12) Lecture#20 (Monday November 12) Gaussian Processes

Regression I: Mean Squared Error and Measuring Quality of Fit

Functions and Data Fitting

Non-Parametric Bayes

Machine Learning for Large-Scale Data Analysis and Decision Making A. Week #1

CS491/691: Introduction to Aerial Robotics

Lecture 16 Deep Neural Generative Models

Probabilistic Graphical Models

Reasoning under Uncertainty: Intro to Probability

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Linear Regression and Its Applications

CMPSCI 240: Reasoning about Uncertainty

Intelligent Systems (AI-2)

Lecture 7: Con3nuous Latent Variable Models

Gaussian Processes. 1 What problems can be solved by Gaussian Processes?

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

BAYESIAN MACHINE LEARNING.

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Linear Models for Regression CS534

Transcription:

CS 540: Machine Learning Lecture 1: Introduction AD January 2008 AD () January 2008 1 / 41

Acknowledgments Thanks to Nando de Freitas Kevin Murphy AD () January 2008 2 / 41

Administrivia & Announcement Webpage: http://www.cs.ubc.ca/~arnaud/cs540_2008.html O ce hours: Tuesday & Thursday 11.30-12.30. Introductory tutorial Matlab session by Ian Mitchell TODAY, 5pm - 7pm, DMP 110 Matlab resources: http://www.cs.ubc.ca/~mitchell/matlabresources.html AD () January 2008 3 / 41

Pre-requisites Maths: multivariate calculs, linear algebra, probability. CS: programming skills, knowledge of data structures and algorithms helpful but not crucial. Prior exposure to statistics/machine learning is highly desirable but you will be ok as long as your math is good. I will discuss Bayesian statistics at length. AD () January 2008 4 / 41

(Tentative) grading The midterm will be sometimes around the end of February and is worth 30% The nal project will be in April?? and is worth 40%. There will be three (long) assignements worth 30%. Additional optional exercises will be given. AD () January 2008 5 / 41

Textbook C.M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006. Check the webpage http://research.microsoft.com/~cmbishop/prml/ Corrections of exercises available on this webpage. Each week, there will be required reading, optional reading and background reading. Optional Textbook: Hastie, Tibshirani and Friedman, Elements of Statistical Learning - Data Mining, Inference and Prediction, Springer-Verlag, 2001. AD () January 2008 6 / 41

Related classes CPSC 532L - Multiagent Systems (Leyton-Brown). CPSC 532M - Dynamic Programming (Ian Mitchell) Statistics courses. AD () January 2008 7 / 41

What is Machine Learning? Learning denotes changes in the system that are adaptive in the sense that they enable the system to do the task or tasks drawn from the same population more e ciently and more e ectively the next time Herb Simon. Machine Learning is an interdisciplinary eld at the intersection of Statistics, CS and EE etc. At the beginning, Machine Learning was fairly heuristic but it has now evolved and become -in my opinion- Applied Statistics with a CS avour. Aim of this course: introducing basic models, concepts and algorithms. AD () January 2008 8 / 41

Main Learning Tasks Supervised learning: given a training set of N input-output pairs fx n, t n g 2 X T, construct a function f : X! T to predict the output bt = f (x) associated to a new input x. - Each input x n is a p-dimensional feature vector (covariates, explanatory variables). - Each output t n is a target variables (response). Regression corresponds to T =R d. Classi cation corresponds to T = f1,..., K g. Aim is to produce the correct output given a new input. AD () January 2008 9 / 41

Polynomial regression Figure: Polynomial regression where x 2 R and t 2 R AD () January 2008 10 / 41

Linear regression Figure: Linear least square tting for x 2 R 2 and t 2 R. We seek the linear function of x that minimizes the sum of square residuals for y. AD () January 2008 11 / 41

Piecewise linear regression Figure: Piecewise linear regression function AD () January 2008 12 / 41

Figure: Linear classi er where x n 2 R 2 and t n 2 f0, 1g AD () January 2008 13 / 41

Handwritten digit recognition Figure: Examples of handwritten digits from US postal employes AD () January 2008 14 / 41

Figure: Can you pick out the tufas? AD () January 2008 15 / 41

Practical examples of classi cation Email spam ltering (feature vector = bag of words ) Webpage classi cation Detecting credit card fraud Credit scoring Face detection in images AD () January 2008 16 / 41

Unsupervised learning: we are given training data fx n g. Aim is to produce a model or build useful representations for x modeling the distribution of the data, clustering, data association, dimensionality reduction, structure learning. AD () January 2008 17 / 41

Hard and Soft Clustering Figure: Data (left), hard clustering (middle), soft clustering (right) AD () January 2008 18 / 41

Finite Mixture Models Finite Mixture of Gaussians f (xj θ) = k p i N i=1 x; µ i, σ 2 i where θ = µ i, σ 2 i, p i i=1,...,k is estimated from some data (x 1,..., x n ). How to estimate the parameters? How to estimate the number of components k? AD () January 2008 19 / 41

AD () January 2008 20 / 41

AD () January 2008 21 / 41

Modelling dependent data AD () January 2008 22 / 41

AD () January 2008 23 / 41

Hidden Markov Models In this case, we have x n j x n 1 f ( j x n 1 ), y n j x n g ( j x n ). For example, stochastic volatility model x n = αx n 1 + σv n where v n N (0, 1) y n = β exp (x n /2) w n where w n N (0, 1) where the process (y n ) is observed but (x n, α, σ, β) are unknown. AD () January 2008 24 / 41

Tracking Application AD () January 2008 25 / 41

Hierarchical Clustering Figure: Dendogram from agglomerative hierarchical clustering with average linkage to the human microarray data AD () January 2008 26 / 41

Dimensionality reduction Figure: Simulated data in three classes, near the surface of a half-sphere AD () January 2008 27 / 41

Figure: The best rank-two linear approximation to the half-sphere data. The right panel shows the projected points with coordinates given by the rst to principal components of the data AD () January 2008 28 / 41

Manifold learning Figure: Data lying on a low-dimensional manifold AD () January 2008 29 / 41

Structure learning Figure: Graphical model AD () January 2008 30 / 41

Ockham s razor How can we predict the test set if it may be arbitrarily di erent from the training set? We need to make an assumption that the future will be like the present in statistical terms Any hypothesis that agrees with all is as good as any other, but we generally prefer simpler ones. AD () January 2008 31 / 41

Active learning Active learning is a principled way of integrating decision theory with traditional statistical methods for learning models from data In active learning, the machine can query the environment. That is, it can ask questions. Decision theory leads to optimal strategies for choosing when and what questions to ask in order to gather the best possible data. Good data is often better than a lot of data AD () January 2008 32 / 41

Active learning and surveillance In network with thousands of cameras, which camera views should be presented to the human operator? AD () January 2008 33 / 41

Active learning and sensor networks How doe we optimally choose among a subset of sensors in order to obtain the best understanding of the world while minimizing resource expenditure (power, bandwidth, distractions)? AD () January 2008 34 / 41

Active learning example AD () January 2008 35 / 41

Decision theoretic foundations of machine learning Agents (humans, animals, machines) always have to make decisions under uncertainty Uncertainty arises because we do not know the true state of the world; e.g. because of limited eld of view, because of noise, etc. The optimal way to behave when faced with uncertainty is as follows Estimate the state of the world, given the evidence you have seen up until now (this is called your belief state). Given your preferences over possible future states (expressed as a utility function), and your beliefs about the e ects of your actions, pick the action that you think will be the best. AD () January 2008 36 / 41

Decision making under uncertainty You should choose the action with maximum expected utility a t+1 = arg max a t+1 2A x P θ (X t+1 = xj a 1:t+1, y 1:t ) U φ (x) where X t represents the true (unknown) state of the world at time t y 1:t = y 1,..., y t represents the data I have seen up until now a 1:t are the actions that I have taken up until now U(x) is a utility function on state x θ are parameters of your world model φ are parameters of your utility function AD () January 2008 37 / 41

Decision making under uncertainty We will later justify why your beliefs should be represented using probability theory. It can also be shown that all your preferences can be expressed using a scalar utility function (shocking, but true). Machine learning is concerned with estimating models of the world (θ) given data, and using them to compute belief states (i.e., probabilistic inference). Preference elicitation is concerned with estimating models (φ) of utility functions (we will not consider this issue). AD () January 2008 38 / 41

Decision making under uncertainty: examples Robot navigation: X t =robot location, Y t = video image, A t = direction to move, U(X ) = distance to goal. Medical diagnosis: X t = whether the patient has cancer, Y t = symptoms, A t = perform clinical test, U(X ) = health of patient. Financial prediction: X t = true state of economy, Y t = observed stock prices, A t = invest in stock i, U(X ) = wealth. AD () January 2008 39 / 41

Tentative topics Review of probability and Bayesian statistics Linear and nonlinear regression and classi cation Nonparametric and model-based clustering Finite mixture and hidden Markov models, EM algorithms. Graphical models. Variational methods. MCMC and particle lters. PCA, Factor analysis Boosting. AD () January 2008 40 / 41

Project A list of potential projects will be listed shortly. You can propose your own idea. I expect the nal project to be written like a conference paper. AD () January 2008 41 / 41