UVA$CS$6316$$ $Fall$2015$Graduate:$$ Machine$Learning$$ $ $Lecture$18:$Neural'Network'/'Deep' Learning$

Size: px

Start display at page:

Download "UVA$CS$6316$$ $Fall$2015$Graduate:$$ Machine$Learning$$ $ $Lecture$18:$Neural'Network'/'Deep' Learning$"

Martha Conley
5 years ago
Views:

1 UVA$CS$6316$$ $Fall$2015$Graduate:$$ Machine$Learning$$ $ $Lecture$18:$NeuralNetwork/Deep Learning$ Dr.YanjunQi UniversityofVirginia Departmentof ComputerScience 1 Wherearewe?! FivemajorsecKonsofthiscourse " Regression(supervised) " ClassificaKon(supervised) " Unsupervisedmodels " Learningtheory " Graphicalmodels 2

2 A study comparing Classifiers Proceedingsofthe23rdInternaKonal ConferenceonMachineLearning(ICML`06). 3 A study comparing Classifiers! 11binaryclassificaKonproblems/8metrics Proceedingsofthe23rdInternaKonal ConferenceonMachineLearning(ICML`06). 4

3 A study comparing Classifiers! 11binaryclassificaKonproblems Proceedingsofthe23rdInternaKonal ConferenceonMachineLearning(ICML`06). 5 Today$ # BasicNeuralNetwork(NN) # singleneuron,e.g.logiskcregressionunit # mulklayerperceptron(mlp) # formulk\classclassificakon,so]maxlayer # MoreabouttrainingNN # DeepCNN,Deeplearning # History # Whyisthisbreakthrough? # RecentapplicaKons 6

4 Today$ # BasicNeuralNetwork(NN) # singleneuron,e.g.logiskcregressionunit # mulklayerperceptron(mlp) # formulk\classclassificakon,so]maxlayer # MoreabouttrainingNN # DeepCNN,Deeplearning # History # Whyisthisbreakthrough? # RecentapplicaKons 7 Logistic Regression Task classification Representation Score Function Log-odds(Y) = linear function of X s EPE, with conditional Log-likelihood Search/Optimization Iterative (Newton) method Models, Parameters Logistic weights 1 P( y = 1 x) = (α +βx) 1+ e 8

UsingLogisKcFuncKontoTransfer e.g. Probabilityof disease P (Y=1 X) 1.0 0.8 α+ βx e P(y x) = α+ βx 1+ e 0.6 0.4 0.2 0.

5 UsingLogisKcFuncKontoTransfer e.g. Probabilityof disease P (Y=1 X) α+ βx e P(y x) = α+ βx 1+ e x Logistic regression Logistic regression could be illustrated as a module On input x, it outputs ŷ: where Draw a logistic regression unit as: x 1 x 2 x 3 +1 Bias w T! x + b 1 1+ e z Σ Summing function Activation function Output ŷ=p(y=1 X,Θ) 10

6 1neuron,e.g. 1Neronexample x 1 w 1 x 2 w 2 w 3 Σ Output x 3 +1 Bias b Summing function Activation function 11 Transfer / Activation functions Common ones include: Threshold / Step function: f(v) = 1 if v > c, else -1 Sigmoid (s shape func): E.g. logistic func: f(v) = 1/(1 + e -v ), Range [0, 1] Hyperbolic Tanh : f(v) = (e v e -v )/ (e v + e -v ), Range [-1,1] sigmoid func 1 a 0 0 z a = e z Desirable properties: Monotonic, Nonlinear, Bounded Easily calculated derivative step func 12

7 Perceptron:$Another$1ENeuron$Unit (Specialformofsinglelayerfeedforward) TheperceptronwasfirstproposedbyRosenblaj (1958)isasimpleneuronthatisusedtoclassify itsinputintooneoftwocategories. Aperceptronusesastep$funcIonthatreturns+1 ifweightedsumofitsinputlargeorequalto0, and\1otherwise b (bias) + 1if v 0 ϕ( v) = x 1 1if v < 0 x 2 x n w 1 w 2 w n Σ v Summing function φ(v) y Activation (Step) function 13 Today$ # BasicNeuralNetwork(NN) # singleneuron,e.g.logiskcregressionunit # mulklayerperceptron(mlp) # formulk\classclassificakon,so]maxlayer # MoreabouttrainingNN # DeepCNN,Deeplearning # History # Whyisthisbreakthrough? # RecentapplicaKons 14

8 SingleLayerFeed\forward i.e.(onelayerof4outputneurons) Input layer of source nodes Output layer of neurons! y!" input output Fourneurons 15 MulK\LayerPerceptron(MLP) String a lot of logistic units together. Example: 3 layer network: x 1 x 2 a 1 a 2 a 3 y x 3 Layer Layer1 Layer2 input hidden output 16

9 MulK\LayerPerceptron(MLP) Example:4layernetworkwith2outputunits: x 1 x 2 x 3! y!" Layer3 Layer4 Layer1 Layer2 input hidden hidden output 17 Types of Neural Networks (according to different attributes) Connection Type (e.g. - Static (feed-forward) - Dynamic (feedback) Topology (e.g. - Single layer - Multilayer - Recurrent - Recursive - Self-organized Learning Methods (e.g. - Supervised - Unsupervised 18

Youcangoverycrazywith designingsuchneuralnetworks.

10 Youcangoverycrazywith designingsuchneuralnetworks.. Prof.NandodeFreitas slide 19 Today$ # BasicNeuralNetwork(NN) #singleneuron,e.g.logiskcregressionunit #mulklayerperceptron(mlp) #formulk\classclassiﬁcakon,so]maxlayer #MoreabouttrainingNN # DeepCNN,Deeplearning #History #Whyisthisbreakthrough? #RecentapplicaKons 20

11 WhenforRegression j k w jk (y k -ŷ k ) TrainingNNin orderto minimizethe networktotal sumsquared error(sse). v ij i 21 WhenforclassificaKon (e.g.1neuronforbinaryoutputlayer) ŷ=p(y=1 X,Θ) j ForBernoullidistribuKon, p(y =1 x) y (1 p) 1 y i N E(θ) = Loss(θ) = {logpr(y = y i X = x i )} N i=1 = y i log(ŷ i )+ (1 y i )log(1 ŷ i ) i=1 Cross\entropylossfuncKon,ORnamedas deviance 22

$y 1 z y y 2 3 z 1 2 z Outputunits 3 WhenformulK\classclassificaKon (lastoutputlayer:so]maxlayer) WhenmulK\classoutput,lastlayeris so]maxoutputlayer!$

12 y 1 z y y 2 3 z 1 2 z Outputunits 3 WhenformulK\classclassificaKon (lastoutputlayer:so]maxlayer) WhenmulK\classoutput,lastlayeris so]maxoutputlayer!a mulknomiallogiskcregressionunit 23 Review:$$$MulIE class$variable$ representaion$$ Class MulK\classvariable! Anindicatorbasisvector representakon IfoutputvariableGhasK classes,therewillbekindicator variabley_i N f 1 (x),f 2 (x),f 3 (x),f 4 (x) HowtoclassifytomulK\class? StrategyI:learnKdifferent regressionfunckons,thenmax 24

$StrategyII:Use so]max layerfunckonfor mulk\classclassificakon y 1 y y 2 3 So]max Output z z 1 2 z 3 y i = j ez i e z j$ negakvelogproboftherightanswer!crossentropylossfunckon: E x () = E z i = j truey j ln y j = truey j ln p(y j =1 x) j=1.

negakvelogproboftherightanswer!crossentropylossfunckon: E x () = E z i = j truey j ln y j = truey j ln p(y j =1 x) j=1.

13 StrategyII:Use so]max layerfunckonfor mulk\classclassificakon y 1 y y 2 3 So]max Output z z 1 2 z 3 y i = j ez i e z j y i z i = y i (1 y i ) 25 Use so]max layerfunckonfor mulk\classclassificakon ThenaturalcostfuncKonisthe negakvelogproboftherightanswer!crossentropylossfunckon: E x () = E z i = j truey j ln y j = truey j ln p(y j =1 x) j=1...k E y j = y i truey i y j z i j Errorcalculated fromoutputvs. true ThesteepnessoffuncKonEexactlybalancesthe flatnessoftheso]max 26

14 Aspecialcaseofso]maxfortwoclasses y 1 = e z 1 e z 1 + e z 0 = 1+ e 1 ( z 1 z 0 ) SothelogisKcisjustaspecialcasethatavoids usingredundantparameters: Addingthesameconstanttobothz1andz0has noeffect. Theover\parameterizaKonoftheso]maxis becausetheprobabilikesmustaddto1. 27 Today$ # BasicNeuralNetwork(NN) # singleneuron,e.g.logiskcregressionunit # mulklayerperceptron(mlp) # formulk\classclassificakon,so]maxlayer # MoreabouttrainingMLPNN # DeepCNN,Deeplearning # History # Whyisthisbreakthrough? # RecentapplicaKons 28

15 Backpropagation Using backward recurrence to jointly optimize all parameters Requires all activation functions to be differentiable Enables flexible design in deep model architecture Gradient descent is used to (locally) minimize objective: W k+1 = W k η E W k Y. LeCun et al Efficient BackProp. Olivier Bousquet and Ulrike von Luxburg Stochastic Learning. 29 Training a neural network Given training set (x 1, y 1 ), (x 2, y 2 ), (x 3, y 3 ),. Adjust parameters (for every node) to make: the predicted output close to true label (Use gradient descent. Backpropagation algorithm. Susceptible to local optima.) 30

16 Review:LinearRegressionwith StochasKcGD! Wehavethefollowingdescentrule: n 1 T J ( θ ) = ( xi θ yi ) 2 i= 1 2 t+1 t θ j = θ j α J (θ ) θ Howdowepickα? 1. Tuning set, or 2. Cross validation, or 3. Small for slow, conservative learning j t 31 ForLR:linearregression,Wehavethe followingdescentrule: Review:StochasKcGD! t+1 t θ j = θ j α J (θ ) θ!forneuralnetwork,wehavethedeltarule j t Δw = η E W t W t+1 = W t η E! W = W t + Δw t 32

$Backpropagation Back\propagaKontrainingalgorithm Network activation Forward Step Error$ forregression for sigmoid unit o, its derivative is, o (h) = o(h) * (1 - o(h)) input x 1

forregression for sigmoid unit o, its derivative is, o (h) = o(h) * (1 - o(h)) input x 1

17 Backpropagation Back\propagaKontrainingalgorithm Network activation Forward Step Error propagation Backward Step 33 Forinstance!forregression for sigmoid unit o, its derivative is, o (h) = o(h) * (1 - o(h)) input x 1 x 2 +1 w 1 w 2 w 3 w 4 b 1 b 2 Output h 1 Σ Σ Θ 1 Output o 1 w 6 h 2 o 2 w 5 Σ f 1 f 2 f 3 +1 b 3 Output ŷ f 4 Θ 1 Θ 2 Θ 3 Θ 4 E(y,ŷ) T=4 34

18 35 Dr.QI scikm2012paper/talk 36

19 37 T Error propagation Backward Step Dr.QI scikm2012paper/talk 38

20 Backpropagation 1.IniKalizenetworkwithrandomweights 2.Foralltrainingcases(examples): a.presenttraininginputstonetworkand calculateoutputofeachlayer,andfinallayer b.foralllayers(starkngwiththeoutput layer,backtoinputlayer): i.comparenetworkoutputwithcorrectoutput (errorfunckon) ii.adaptweightsincurrentlayer W t+1 = W t η E W t 39 Backpropagation Algorithm Main Idea: error in hidden layers The ideas of the algorithm can be summarized as follows : 1. Computes the error term for the output units using the observed error. 2. From output layer, repeat - propagating the error term back to the previous layer and - updating the weights between the two layers until the earliest hidden layer is reached. 40

21 Today$ # BasicNeuralNetwork(NN) # singleneuron,e.g.logiskcregressionunit # mulklayerperceptron(mlp) # formulk\classclassificakon,so]maxlayer # MoreabouttrainingNN # DeepCNN,Deeplearning # History # Whyisthisbreakthrough? # RecentapplicaKons 41 ManyclassificaKonmodels inventedsincelate80 s Neuralnetworks BoosKng SupportVectorMachine MaximumEntropy RandomForest 42

FirstNNsuccessfullytrainedwithmanylayers 43 LetNet EarlysuccessatOCR

22 DeepLearninginthe90 s Prof.YannLeCuninventedConvoluKonal NeuralNetworks FirstNNsuccessfullytrainedwithmanylayers 43 LetNet EarlysuccessatOCR Y.LeCun,L.Bojou,Y.Bengio,andP.Haffner,Gradient\basedlearningappliedtodocumentrecogniKon, ProceedingsoftheIEEE86(11): ,

23 Between~2000to~2011 MachineLearningFieldInterest LearningwithStructures! Kernellearning TransferLearning Semi\supervised ManifoldLearning SparseLearning Structuredinput\outputlearning Graphicalmodel 45 Winter$of$Neural$Networks $ Since$90 s$!$$$$$to~2010 Non\convex Needalotoftrickstoplaywith Howmanylayers? Howmanyhiddenunitsperlayer? Whattopologyamonglayers?. HardtoperformtheoreKcalanalysis 46

24 Today$ # BasicNeuralNetwork(NN) # singleneuron,e.g.logiskcregressionunit # mulklayerperceptron(mlp) # formulk\classclassificakon,so]maxlayer # MoreabouttrainingNN # DeepCNN,Deeplearning # History # Whyisthisabreakthrough? # RecentapplicaKons 47 Large\ScaleVisualRecogniKonChallenge % 30% Errorrate 25% 20% 15% 10% 10%improve withdeepcnn 5% 0% ISI OXFORD_VGG XRCE/INRIA Universityof Amsterdam LEAR\XRCE SuperVision 48

$0% HMM\GMM Hasn tbeabletoimprove for10years!!! 12.5% 6.3% 49 Read ConversaKonal Source:Huangetal.$

25 SpeechRecogniKon Year % 50.0% Worderrorrate 25.0% HMM\GMM Hasn tbeabletoimprove for10years!!! 12.5% 6.3% 49 Read ConversaKonal Source:Huangetal.,CommunicaKonsACM01/

26 WHY$BREAKTHROUGH$?$$ 51 Howcanwebuildmoreintelligent computer/machine? Ableto perceive$the$world$ understand$the$world$$ Thisneeds BasicspeechcapabiliKes BasicvisioncapabiliKes Languageunderstanding Userbehavior/emoKonunderstanding Abletothink?? 52

PlentyofData Text:trillionsofwordsofEnglish+otherlanguages Visual:billionsofimagesandvideos Audio:thousandsofhoursofspeechperday UseracKvity:queries,userpageclicks,maprequests,etc,

27 PlentyofData Text:trillionsofwordsofEnglish+otherlanguages Visual:billionsofimagesandvideos Audio:thousandsofhoursofspeechperday UseracKvity:queries,userpageclicks,maprequests,etc, Knowledgegraph:billionsoflabeledrelaKonaltriplets Dr.JeffDean stalk 53 Deep$Learning$Way:$$ Learning$features$/$RepresentaIon$from$data Feature Engineering $ Most critical for accuracy $ # Account for most of the computation for testing $ #Most time-consuming in development cycle $ # Often hand-craft and task dependent in practice Feature Learning $ Easily adaptable to new similar tasks $ Layerwise representation $ Layer-by-layer unsupervised training $ Layer-by-layer supervised training 54

28 DESIGN ISSUES for Deep$NN DatarepresentaKon NetworkTopology NetworkParameters Training Scaling up with graphics processors Scaling up with Asynchronous$SGD ValidaKon 55 Today$ # BasicNeuralNetwork(NN) # singleneuron,e.g.logiskcregressionunit # mulklayerperceptron(mlp) # formulk\classclassificakon,so]maxlayer # MoreabouttrainingNN # DeepCNN,Deeplearning # History # Whyisthisabreakthrough? # RecentapplicaKons 56

Application I: Objective Recognition / Image Labeling Deep Convolution Neural Network (CNN) won (as Best systems) on very large-scale ImageNet competition 2012 / 2013 / 2014 (training on 1.

29 Application I: Objective Recognition / Image Labeling Deep Convolution Neural Network (CNN) won (as Best systems) on very large-scale ImageNet competition 2012 / 2013 / 2014 (training on 1.2 million images [X] vs.1000 different word labels [Y]) 89%, 2013 \2013,GoogleAcquiredDeepNeuralNetworksCompanyheadedby Utoronto DeepLearning ProfessorHinton \2013,FacebookBuiltNewArKficialIntelligenceLabheadedbyNYU DeepLearning ProfessorLeCun 57 OlivierGrisel stalk 58

30 Dr.JeffDean stalk 59 OlivierGrisel stalk Application II: Semantic Understanding Wordembedding(e.g.googleword2vector/skipgram) Learn to embed each word into a vector of real values Semantically similar words have closer embedding representations Progress in 2013/14 CanuncoversemanKc/syntacKcwordrelaKonships Dr.LiDeng stalk Dr.JeffDean stalk 60

31 61 Application III: Deep LearningtoPlayGame DeepMind:LearningtoPlay&windozensof Atarigames anewdeepreinforcementlearningalgorithm OlivierGrisel stalk 62

32 Application IV: DeepLearningto ExecuteandProgram GoogleBrain&NYU,October2014 RNNtrainedtomapcharacterrepresentaKons ofprogramstooutputs CanlearntoemulateasimplisKcPython interpreterlimitedtoone\passprogramswith O(n)complexity OlivierGrisel stalk 63 Application IV: DeepLearningto ExecuteandProgram OlivierGrisel stalk 64

33 Summary$$ # BasicNeuralNetwork(NN) # singleneuron,e.g.logiskcregressionunit # mulklayerperceptron(mlp) # formulk\classclassificakon,so]maxlayer # MoreabouttrainingNN # DeepCNN,Deeplearning # History # Whyisthisabreakthrough? # RecentapplicaKons 65 References$$ " Dr.YannLecun sdeeplearningtutorials " Dr.LiDeng sicml2014deeplearningtutorial " Dr.KaiYu sdeeplearningtutorial " Dr.RobFergus deeplearningtutorial " Prof.NandodeFreitas slides " OlivierGrisel stalkatparisdatageeks/open WorldForum " HasKe,Trevor,etal.Theelementsofsta7s7cal learning.vol.2.no.1.newyork:springer,

Announcements:!Rough!Plan!!

Announcements:!Rough!Plan!! UVA$CS$4501$+$001$/$6501$ $007$ Introduc8on$to$Machine$Learning$and$ Data$Mining$$ $ Lecture$26:History/Review$ YanjunQi/Jane,,PhD UniversityofVirginia Departmentof ComputerScience YanjunQi/UVACS4501O01O6501O07