Unsupervised Learning and Other Neural Networks

Similar documents
Chapter 4 (Part 1): Non-Parametric Classification (Sections ) Pattern Classification 4.3) Announcements

Introduction to local (nonparametric) density estimation. methods

Kernel-based Methods and Support Vector Machines

KLT Tracker. Alignment. 1. Detect Harris corners in the first frame. 2. For each Harris corner compute motion between consecutive frames

Bayes (Naïve or not) Classifiers: Generative Approach

6. Nonparametric techniques

CS 2750 Machine Learning. Lecture 8. Linear regression. CS 2750 Machine Learning. Linear regression. is a linear combination of input components x

Econometric Methods. Review of Estimation

Summary of the lecture in Biostatistics

Chapter 14 Logistic Regression Models

( ) = ( ) ( ) Chapter 13 Asymptotic Theory and Stochastic Regressors. Stochastic regressors model

Point Estimation: definition of estimators

Generative classification models

Simulation Output Analysis

Supervised learning: Linear regression Logistic regression

CS 2750 Machine Learning Lecture 5. Density estimation. Density estimation

Feature Selection: Part 2. 1 Greedy Algorithms (continued from the last lecture)

The Mathematical Appendix

Chapter 9 Jordan Block Matrices

Clustering: K-Means. Machine Learning , Fall Bhavana Dalvi Mishra PhD student LTI, CMU

Classification : Logistic regression. Generative classification model.

Regression and the LMS Algorithm

Lecture 7: Linear and quadratic classifiers

QR Factorization and Singular Value Decomposition COS 323

Parametric Density Estimation: Bayesian Estimation. Naïve Bayes Classifier

Estimation of Stress- Strength Reliability model using finite mixture of exponential distributions

CHAPTER VI Statistical Analysis of Experimental Data

Lecture 3 Probability review (cont d)

Overview. Basic concepts of Bayesian learning. Most probable model given data Coin tosses Linear regression Logistic regression

GOALS The Samples Why Sample the Population? What is a Probability Sample? Four Most Commonly Used Probability Sampling Methods

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 17

{ }{ ( )} (, ) = ( ) ( ) ( ) Chapter 14 Exercises in Sampling Theory. Exercise 1 (Simple random sampling): Solution:

Lecture 3. Sampling, sampling distributions, and parameter estimation

Lecture 2 - What are component and system reliability and how it can be improved?

Chapter 3 Sampling For Proportions and Percentages

Lecture 12: Multilayer perceptrons II

Dimensionality reduction Feature selection

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #1

Arithmetic Mean and Geometric Mean

9.1 Introduction to the probit and logit models

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Functions of Random Variables

Machine Learning. Introduction to Regression. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

CSE 5526: Introduction to Neural Networks Linear Regression

Tema 5: Aprendizaje NO Supervisado: CLUSTERING Unsupervised Learning: CLUSTERING. Febrero-Mayo 2005

Random Variate Generation ENM 307 SIMULATION. Anadolu Üniversitesi, Endüstri Mühendisliği Bölümü. Yrd. Doç. Dr. Gürkan ÖZTÜRK.

Continuous Distributions

Lecture 9: Tolerant Testing

CS 1675 Introduction to Machine Learning Lecture 12 Support vector machines

THE ROYAL STATISTICAL SOCIETY 2016 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE MODULE 5

Support vector machines

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

Machine Learning. Topic 4: Measuring Distance

Special Instructions / Useful Data

LECTURE 2: Linear and quadratic classifiers

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

Module 7. Lecture 7: Statistical parameter estimation

Part I: Background on the Binomial Distribution

Nonparametric Techniques

Binary classification: Support Vector Machines

TESTS BASED ON MAXIMUM LIKELIHOOD

Lecture 16: Backpropogation Algorithm Neural Networks with smooth activation functions

STATISTICAL INFERENCE

X X X E[ ] E X E X. is the ()m n where the ( i,)th. j element is the mean of the ( i,)th., then

( ) 2 2. Multi-Layer Refraction Problem Rafael Espericueta, Bakersfield College, November, 2006

CHAPTER 2. = y ˆ β x (.1022) So we can write

6.867 Machine Learning

CS286.2 Lecture 4: Dinur s Proof of the PCP Theorem

= 2. Statistic - function that doesn't depend on any of the known parameters; examples:

best estimate (mean) for X uncertainty or error in the measurement (systematic, random or statistical) best

THE ROYAL STATISTICAL SOCIETY HIGHER CERTIFICATE

PTAS for Bin-Packing

Chapter 5 Properties of a Random Sample

Discrete Mathematics and Probability Theory Fall 2016 Seshia and Walrand DIS 10b

Lecture Notes 2. The ability to manipulate matrices is critical in economics.

The Optimal Algorithm. 7. Algorithm-Independent Learning. No Free Lunch theorem. Theorem: No Free Lunch. Aleix M. Martinez

Application of Calibration Approach for Regression Coefficient Estimation under Two-stage Sampling Design

UNIT 4 SOME OTHER SAMPLING SCHEMES

1 Onto functions and bijections Applications to Counting

Dimensionality Reduction and Learning

1. The weight of six Golden Retrievers is 66, 61, 70, 67, 92 and 66 pounds. The weight of six Labrador Retrievers is 54, 60, 72, 78, 84 and 67.

CHAPTER 4 RADICAL EXPRESSIONS

STA 105-M BASIC STATISTICS (This is a multiple choice paper.)

STK4011 and STK9011 Autumn 2016

Recall MLR 5 Homskedasticity error u has the same variance given any values of the explanatory variables Var(u x1,...,xk) = 2 or E(UU ) = 2 I

Q-analogue of a Linear Transformation Preserving Log-concavity

Simple Linear Regression

BASICS ON DISTRIBUTIONS

Objectives of Multiple Regression

ABOUT ONE APPROACH TO APPROXIMATION OF CONTINUOUS FUNCTION BY THREE-LAYERED NEURAL NETWORK

Class 13,14 June 17, 19, 2015

Nonparametric Density Estimation Intro

Strong Convergence of Weighted Averaged Approximants of Asymptotically Nonexpansive Mappings in Banach Spaces without Uniform Convexity

CIS 800/002 The Algorithmic Foundations of Data Privacy October 13, Lecture 9. Database Update Algorithms: Multiplicative Weights

PROJECTION PROBLEM FOR REGULAR POLYGONS

Dr. Shalabh. Indian Institute of Technology Kanpur


3. Basic Concepts: Consequences and Properties

A conic cutting surface method for linear-quadraticsemidefinite

Outline. Point Pattern Analysis Part I. Revisit IRP/CSR

Transcription:

CSE 53 Soft Computg NOT PART OF THE FINAL Usupervsed Learg ad Other Neural Networs Itroducto Mture Destes ad Idetfablty ML Estmates Applcato to Normal Mtures Other Neural Networs Itroducto Prevously, all our trag samples were labeled: these samples were sad supervsed We ow vestgate a umber of usupervsed procedures whch use ulabeled samples Collectg ad Labelg a large set of sample patters ca be costly We ca tra wth large amouts of (less epesve) ulabeled data, ad oly the use supervso to label the groupgs foud, ths s approprate for large data mg applcatos where the cotets of a large database are ot ow beforehad Usupervsed Learg ad Other Neural Networs

CSE 53 Soft Computg Ths s also approprate may applcatos whe the characterstcs of the patters ca chage slowly wth tme (such as automated food classfcato as the seasos chage) Improved performace ca be acheved f classfers rug a usupervsed mode are used We ca use usupervsed methods to detfy features (through clusterg) that wll the be useful for categorzato (or classfcato) We ga some sght to the ature (or structure) of the data Mture Destes ad Idetfablty We shall beg wth the assumpto that the fuctoal forms for the uderlyg probablty destes are ow ad that the oly thg that must be leared s the value of a uow parameter vector We mae the followg assumptos: The samples come from a ow umber c of classes The pror probabltes P(ω ) for each class are ow (,,c) P( ω, θ ) (,,c) are ow but mght be dfferet The values of the c parameter vectors θ, θ,, θ c are uow Usupervsed Learg ad Other Neural Networs

CSE 53 Soft Computg The category labels are uow P( θ) 64748 ). P( ω ) destes c compoet P( ω, θ t where θ [ θ, θ,..., θ ] Ths desty fucto s called a mture desty c 3 mg parameters Our goal wll be to use samples draw from ths mture desty to estmate the uow parameter vector θ. Oce θ s ow, we ca decompose the mture to ts compoets ad use a MAP classfer o the derved destes Defto A desty P( θ) s sad to be detfable (or ectve) f θ θ mples that there ests a such that: P( θ) P( θ ) As a smple eample, cosder the case where s bary ad P( θ) s the followg mture: P( θ) θ ( θ) + θ ( θ ) ( θ + θ ) f - ( θ + θ ) f 0 Assume that: P( θ) 0.6 P( 0 θ) 0.4 by replacg these probabltes values, we obta: θ + θ. Usupervsed Learg ad Other Neural Networs 3

CSE 53 Soft Computg Thus, we have a case whch the mture dstrbuto s completely udetfable, ad therefore usupervsed learg s mpossble I the dscrete dstrbutos, f there are too may compoets the mture, there may be more uows tha depedet equatos, ad detfablty ca become a serous problem! Whle t ca be show that mtures of ormal destes are usually detfable, the parameters the smple mture desty P( ω ) P( θ) ep ( θ π P( ω ) + ep ( θ π ) ) Caot be uquely detfed f P(ω ) P(ω ) (we caot recover a uque θ eve from a fte amout of data!) θ (θ, θ ) ad θ (θ, θ ) are two possble vectors that ca be terchaged wthout affectg P( θ) Idetfablty ca be a problem, we always assume that the destes we are dealg wth are detfable! Usupervsed Learg ad Other Neural Networs 4

CSE 53 Soft Computg ML Estmates Suppose that we have a set D {,, } of ulabeled samples draw depedetly from the mture desty p( θ) (θ s fed but uow!) θ c p( ω, θ )P( ω) θˆ argmap(d θ) wth p(d θ) p( θ) The gradet of the log-lelhood s: θ l P( ω, θ) θ lp( ω, θ ) Sce the gradet must vash at the value of θ that mamzes l (l the ML estmate ˆθ lp( θ, therefore, )) must satsfy the codtos P( ω, θˆ) ˆ,..., c) θ lp( ω, θ) 0 ( By cludg the pror probabltes as uow varables, we fally ca show that: Pˆ ( ω) ad Pˆ ( ω where : Pˆ ( ω Pˆ ( ω, θˆ), θˆ) θ, θˆ) lp( c p( p( ω, θˆ ) 0 ω, θˆ )Pˆ ( ω ) ω, θˆ )Pˆ ( ω ) Ths equato eables clusterg Usupervsed Learg ad Other Neural Networs 5

CSE 53 Soft Computg Applcatos to Normal Mtures p( ω, θ ) ~ N(µ, Σ ) Case µ Σ P(ω ) c 3 Case Smplest case Case : Uow mea vectors µ θ,, c lp( ω ML estmate of µ (µ ) s: [ π ] d / / t ) ( µ ), µ ) l ( ( µ ) µ ˆ P( ω P( ω, µ ˆ ), µ ˆ ) Weghted average of the samples Comg from the -th class () P( ω, µ ˆ ) s the fracto of those samples havg value that come from the th class, ad average of the samples comg from the th class. ˆµ s the Usupervsed Learg ad Other Neural Networs 6

CSE 53 Soft Computg Ufortuately, equato () does ot gve ˆµ eplctly However, f we have some way of obtag good tal estmates µ ˆ (0) for the uow meas, therefore equato () ca be see as a teratve process for mprovg the estmates µ ˆ ( + ) P( ω P( ω, µ ˆ()), µ ˆ()) Ths s a gradet ascet for mamzg the loglelhood fucto Eample (Class): Cosder the smple two-compoet oe-dmesoal ormal mture p( µ, µ ) ep ( µ ) + ep ( µ ) 3 π 3 π ( clusters!) Let s draw 5 samples sequetally from ths mture (see Table p.54: wth ω ) wth µ -, ad µ The log-lelhood fucto s: l( µ, µ ) lp( µ, µ ) Usupervsed Learg ad Other Neural Networs 7

CSE 53 Soft Computg The mamum value of l occurs at: µ ˆ.30 ad µ ˆ.668 (whch are ot far from the true values: µ - ad µ +) There s aother pea at ˆ ˆ µ.085 ad µ.57 whch has almost the same heght as ca be see from the followg fgure: Usupervsed Learg ad Other Neural Networs 8

CSE 53 Soft Computg Ths mture of ormal destes s detfable Whe the mture desty s ot detfable, the ML soluto s ot uque Case : All parameters uow No costrats are placed o the covarace matr Let p( µ, σ ) be the two-compoet ormal mture: p( µ, σ ) µ ep π. σ σ + ep π Suppose µ, therefore: p( µ, σ ) + ep π σ π For the rest of the samples: p( µ, σ ) ep π p( Fally, + µ, σ ) ep ep σ ( π) 444 4443,..., ths term σ 0 The lelhood s therefore large ad the mamum-lelhood soluto becomes sgular. Usupervsed Learg ad Other Neural Networs 9

CSE 53 Soft Computg Addg a assumpto Cosder the largest of the fte local mama of the lelhood fucto ad use the ML estmato. We obta the followg: Iteratve scheme Pˆ( ω ) µ ˆ Σˆ Pˆ( ω Pˆ( ω Pˆ( ω Pˆ( ω, θˆ), θˆ) Pˆ( ω, θˆ), θˆ)( µ ˆ )(, θˆ) t µ ˆ ) Where: Pˆ ( ω, θˆ) c Σ / Σˆ / ep ( ep ( t µ ˆ ) Σˆ ( µ ˆ ) Pˆ ( ω) t µ ˆ ) Σˆ ( µ ˆ ) Pˆ ( ω) K-Meas Clusterg Goal: fd the c mea vectors µ, µ,, µ c Replace the squared Mahalaobs dstace t ( µ ˆ ) Σˆ ( µ ˆ ) by the squared Eucldea dstace µ ˆ ˆµ m Fd the mea earest to ad appromate f m Pˆ ( ω, θˆ) as: Pˆ ( ω, θ) 0 otherwse Usupervsed Learg ad Other Neural Networs 0

CSE 53 Soft Computg µ ˆ, µ,..., µ Use the teratve scheme to fd ˆ ˆ c f s the ow umber of patters ad c the desred umber of clusters, the -meas algorthm s: Beg talze, c, µ, µ,, µ c (radomly selected) do classfy samples accordg to earest µ recompute µ utl o chage µ I retur µ, µ,, µ c Ed Other Neural Networs: Compettve Learg Networs (wer-tae-all) w Output wth hghest Actvato s selected Iput uts actvato value a of Output 3 T T a w w w w 3 ( t + ) w() t + η( () t w()) t w () t + η( () t w ()) t 3 4 Output Uts Weghts Update Oly for the wer Output Usupervsed Learg ad Other Neural Networs

CSE 53 Soft Computg Weght vectors move towards those area where most put appears The weght vectors become the cluster ceters The weght update fds the cluster ceters The followg topcs ca be cosdered by the studets for ther oral presetatos Kohoe Self-Orgazg Networs Learg Vector Quatzato Usupervsed Learg ad Other Neural Networs