ECE-175A Elements f Machine Intelligence - I Ken Kreutz-Delgad Nun Vascncels ECE Department, UCSD Winter 2011
The curse The curse will cver basic, but imprtant, aspects f machine learning and pattern recgnitin We will cver a lt f grund, at the end f the quarter yu ll knw hw t implement a lt f things that may seem very cmplicated tday Hmewrk/Cmputer Assignments will cunt fr 30% f the verall grade. The hmewrk prblems will be graded A fr effrt. Eams: 1 mid-term, date TBA- 30% 1 final 40% (cvers everything) 2
Resurces Curse web page is accessible frm, http://dsp.ucsd.edu/~kreutz All materials, ecept hmewrk and eam slutins will be available there. Slutins will be available in my ffice pd. Curse Instructr: Ken Kreutz-Delgad, kreutz@ece.ucsd.edu, EBU 1-5605. Office hurs: Wednesday, Nn-1pm. Administrative Assistant: Travis Spackman (tspackman@ece.ucsd.edu), EBU1-5600, may smetimes be invlved in administrative issues. Tutr/Grader: Omar Nadeem, nadeem@ucsd.edu. Office hurs: Mn 4-6pm, Jacbs Hall (EBU-1) 4506 Wed 2:30-4:30pm, Jacbs Hall (EBU-1) 5706 4
Tets Required: Intrductin t Machine Learning, 2e Ethem Alpaydin, MIT Press, 2010 Suggested reference tets: Pattern Recgnitin and Machine Learning, C.M. Bishp, Springer 2007. Pattern Classificatin, Duda, Hart, Strk, Wiley, 2001 Prerequisites yu must knw well: Linear algebra, as in Linear Algebra, Strang, 1988 Prbability and cnditinal prbability, as in Fundamentals f Applied Prbability, Drake, McGraw-Hill, 1967 5
The curse Why Machine Learning? There are many prcesses in the wrld that are ruled by deterministic equatins E.g. f = ma; V = IR, Mawell s equatins, and ther physical laws. There are acceptable levels f nise, errr, and ther variability. In such dmains, we dn t need statistical learning. Learning is needed when there is a need fr predictins abut, r classificatin f, randm variables, Y: That represent events, situatins, r bjects in the wrld, and That may (r may nt) depend n ther factrs (variables) X, In a way that is impssible r t difficult t derive an eact, deterministic behaviral equatin fr. In rder t adapt t a cnstantly changing wrld. 6
Eamples and Perspectives Data-Mining viewpint: Large amunts f data that des nt fllw deterministic rules E.g. given an histry f thusands f custmer recrds and sme questins that I can ask yu, hw d I predict that yu will pay n time? Impssible t derive a thery fr this, must be learned While many assciate learning with data-mining, it is by n means the nly imprtant applicatin r viewpint. Signal Prcessing viewpint: Signals cmbine in ways that depend n hidden structure (e.g. speech wavefrms depend n language, grammar, etc.) Signals are usually subject t significant amunts f nise (which smetimes means things we d nt knw hw t mdel ) 7
Eamples (cnt d) Signal Prcessing viewpint: E.g. the Ccktail Party Prblem: Althugh there are all these peple talking ludly at nce, yu can still understand what yur friend is saying. Hw culd yu build a chip t separate the speakers? (As well as yur ear and brain can d.) Mdel the hidden dependence as a linear cmbinatin f independent surces + nise Many ther similar eamples in the areas f wireless, cmmunicatins, signal restratin, etc. 8
Eamples (cnt d) Perceptin/AI viewpint: It is a cmple wrld; ne cannt mdel everything in detail Rely n prbabilistic mdels that eplicitly accunt fr the variability Use the laws f prbability t make inferences. E.g., P( burglar alarm, n earthquake) is high P( burglar alarm, earthquake) is lw There is a whle field that studies perceptin as Bayesian inference In a sense, perceptin really is cnfirming what yu already knw. prirs + bservatins = rbust inference 9
Eamples (cnt d) Cmmunicatins Engineering viewpint: Detectin prblems: X channel Y Yu bserve Y and knw smething abut the statistics f the channel. What was X? This is the cannical detectin prblem. Fr eample, face detectin in cmputer visin: I see piel array Y. Is it a face? 10
What is Statistical Learning? Gal: given a functin f (.) y f ( ) and a cllectin f eample data-pints, learn what the functin f(.) is. This is called training. Tw majr types f learning: Unsupervised: nly X is knwn, usually referred t as clustering; Supervised: bth are knwn during training, nly X knwn at test time, usually referred t as classificatin r regressin. 11
Supervised Learning X can be anything, but the type f knwn data Y dictates the type f supervised learning prblem Y in {0,1} is referred t as Detectin r Binary Classificatin Y in {0,..., M-1} is referred t as (M-ary) Classificatin Y cntinuus is referred t as Regressin Theries are quite similar, and algrithms similar mst f the time We will emphasize classificatin, but will talk abut regressin when particularly insightful 12
Eample Classificatin f Fish: Fish rll dwn a cnveyer belt Camera takes a picture Decide if is this a salmn r a sea-bass? Q1: What is X? E.g. what features d I use t distinguish between the tw fish? This is smewhat f an artfrm. Frequently, the best is t ask dmain eperts. E.g., epert says use verall length and width f scales. 13
Q2: Hw t d Classificatin/Detectin? Tw majr types f classifiers Discriminant: determine the decisin bundary in feature space that best separates the classes; Generative: fit a prbability mdel t each class and then cmpare the prbabilities t find a decisin rule. A lt mre n the intimate relatinship between these tw appraches later! 14
Cautin Hw d we knw learning has wrked? We care abut generalizatin, i.e. accuracy utside the training set Mdels that are t pwerful n the training set can lead t ver-fitting: E.g. in regressin ne can always eactly fit n pts with plynmial f rder n-1. Is this gd? hw likely is the errr t be small utside the training set? Similar prblem fr classificatin Fundamental Rule: nly hld-ut test-set perfrmance results matter!!! 15
Generalizatin Gd generalizatin requires cntrlling the trade-ff between training and test errr training errr large, test errr large training errr smaller, test errr smaller training errr smallest, test errr largest This trade-ff is knwn by many names In the generative classificatin wrld it is usually due t the biasvariance trade-ff f the class mdels 16
Generative Mdel Learning Each class is characterized by a prbability density functin (class cnditinal density), the s-called prbabilistic generative mdel. E.g., a Gaussian. Training data is used t estimate the class pdf s. Overall, the prcess is referred t as density estimatin A nnparametric apprach wuld be t estimate the pdf s using histgrams: 17
Decisin rules Given class pdf s, Bayesian Decisin Thery (BDT) prvides us with ptimal rules fr classificatin Optimal here might mean minimum prbability f errr, fr eample We will Study BDT in detail, Establish cnnectins t ther decisin principles (e.g. linear discriminants) Shw that Bayesian decisins are usually intuitive Derive ptimal rules fr a range f classifiers 18
Features and dimensinality Fr mst f what we have seen s far Thery is well understd Algrithms available Limitatins characterized Usually, gd features are an art-frm We will survey traditinal techniques Bayesian Decisin Thery (BDT) Linear Discriminant Analysis (LDA) Principal Cmpnent Analysis (PCA) and sme mre recent methds Independent Cmpnents Analysis (ICA) Supprt Vectrs Machines (SVM) 19
Discriminant Learning Instead f learning mdels (pdf s) and deriving a decisin bundary frm the mdel, learn the bundary directly There are many such methds. The simplest case is the s-called hyperplane classifier Simply find the hyperplane that best separates the classes, assuming linear separability f the features: 20
Supprt Vectr Machines Hw d we d this? The mst recently develped classifiers are based n the use f supprt vectrs. One transfrms the data int linearly separable features using kernel functins. The best perfrmance is btained by maimizing the margin This is the distance between decisin hyperplane and clsest pint n each side 21
Supprt vectr machines Fr separable classes, the training errr can be made zer by classifying each pint crrectly This can be implemented by slving the ptimizatin prblem w * arg ma w margin( w ) s. t l crrectlyclassified l This is an ptimizatin prblem with n cnstraints, nt trivial but slvable The slutin is the supprt-vectr machine (pints n the margin are the supprt vectrs ) w* 22
Kernels and Linear Separability The trick is t map the prblem t a higher dimensinal space: nn-linear bundary in riginal space becmes hyperplane in transfrmed space 2 1 This can be dne efficiently by the intrductin f a kernel functin Kernel-based feature transfrmatin Classificatin prblem is mapped int a reprducing kernel Hilbert space Kernels are at the cre f the success f SVM classificatin Mst classical linear techniques (e.g. PCA, LDA, ICA, etc.) can be kernelized with significant imprvement n 1 3 2 23
Unsupervised learning S far, we have talked abut supervised learning: We knw the class f each pint In may prblems this is nt feasible t d (e.g. image segmentatin) 24
Unsupervised learning In these prblems we are given X, but nt Y The standard algrithms fr this are iterative: Start frm best guess Given Y-estimates fit class mdels Given class mdels re-estimate Y-estimates The prcedure usually cnverges t an ptimal slutin, althugh nt necessarily the glbal ptimum Perfrmance wrse than that f supervised classifier, but this is the best we can d. 25
Reasns t take the curse T learn abut Classificatin and Statistical Learning tremendus amunt f thery but things invariably g wrng t little data, nise, t many dimensins, training sets that d nt reflect all pssible variability, etc. T learn that gd learning slutins require: knwledge f the dmain (e.g. these are the features t use ) knwledge f the available techniques, their limitatins, etc. In the absence f either f these, yu will fail! T learn skills that are highly valued in the marketplace! 26