Artificial Neural Networks

Size: px

Start display at page:

Download "Artificial Neural Networks"

Christal Willis
5 years ago
Views:

1 Artificial Neural Networks Stephan Dreiseitl University of Applied Sciences Upper Austria at Hagenberg Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support

2 Knowledge textbook verbal rules rule-based systems experience non-verbal patterns pattern recognition

3 A real-life situation

4 and its abstraction (f, 30,1,0,67.8,12.2, ) (m, 52,1,1,57.4,8.9, ) (m, 28, 1,1,51.1,19.2, ) (f, 46, 1,1,16.3,9.5.2, ) (m, 65,1,0,56.1,17.4, ) (m, 38, 1,0,22.8,19.2, ) Model(p)

5 Another real-life situation benign lesion malignant lesion

6 Example: Logistic regression 1 y = 1 + e ( b 1 x 1 +b 2 x 2 +b 0 )

7 So why use ANNs? Human brain good at pattern recognition Mimic structure and processing of brain: Parallel processing Distributed representation Expect: Fault tolerance Good generalization capability More flexible than logistic regression

8 Overview Motivation Perceptrons Multilayer perceptrons Improving generalization Bayesian perspective

9 Terminology input output weights learning covariate dependent var. parameters estimation

10 ANN topology

11 Artificial neurons

12 Activation functions

13 Hyperplanes A vector w = (w 1,,w n ) defines a hyperplane Hyperplane divides n-space of points x = (x 1,,x n ): w 1 x w n x n > 0 w 1 x w n x n = 0 (the plane itself) w 1 x w n x n < 0 Abbreviation: w x := w 1 x w n x n

14 Linear separability Hyperplane through origin: w x = 0 Bias w 0 to move hyperplane from origin: w x + w 0 = 0

15 Linear separability Convention: w := (w 0,w), x := (1,x) Class labels t i {+1,-1} Error measure E = -Σ t i (w x i ) i miscl. How to minimize E?

16 Linear separability Error measure E = -Σ t i (w x i ) 0 i miscl { x w x > 0 } { x w x < 0 }

17 Gradient descent Simple function minimization algorithm Gradient is vector of partial derivatives Negative gradient is direction of steepest descent

18 Perceptron learning Find minimum of E by iterating w k+1 = w k η grad w E E = -Σ t i (w x i ) i miscl. grad w E = -Σ t i x i i miscl. online version: pick misclassified x i w k+1 = w k + η t i x i

19 Perceptron learning Update rule w k+1 = w k + η t i x i Theorem: perceptron learning converges for linearly separable sets

20 Why? From perceptrons to multilayer perceptrons

21 Multilayer perceptrons Sigmoidal hidden layer Can represent arbitrary decision regions Can be trained similar to perceptrons

22 Decision theory Pattern recognition not deterministic Needs language of probability theory Given abstraction x: x model C 1 C 2 Decide C 1 if P(C 1 x) > P(C 2 x)

23 Some background math Have data set D = {(x i,t i )} drawn from probability distribution P(x,t) Model P(x,t) given samples D by ANN with adjustable parameter w Statistics analogy:

24 Some background math Maximize likelihood of data D Likelihood L = Π p(x i,t i ) = Π p(t i x i )p(x i ) Minimize -log L = -Σ log p(t i x i ) -Σ log p(x i ) Drop second term: does not depend on w Two cases: regression and classification

25 Likelihood for regression For regression, targets t are real values Minimize -Σ log p(t i x i ) Assume network outputs y(x i,w) are noisy targets t i Minimizing log L equivalent to minimizing Σ (y(x i,w) t i ) 2 (sum-of-squares error)

26 Likelihood for classification For classification, targets t are class labels Minimize -Σ log p(t i x i ) Assume network outputs y(x i,w) are P(C 1 x) Minimizing log L equivalent to minimizing -Σ t i log y(x i,w) +(1 t i ) * log(1-y(x i,w)) (cross-entropy error)

27 Backpropagation algorithm Minimizing error function by gradient descent: w k+1 = w k η grad w E Iterative gradient calculation by propagating error signals

28 Backpropagation algorithm Problem: how to set learning rate η? Better: use more advanced minimization algorithms (second-order information)

29 Backpropagation algorithm Classification cross-entropy Regression sum-of-squares

30 ANN output for regression Mean of p(t x)

31 ANN output for classification P(t = 1 x)

32 Improving generalization Problem: memorizing (x,t) combinations ( overtraining ) ?

33 Improving generalization Need test set to judge performance Goal: represent information in data set, not noise How to improve generalization? Limit network topology Early stopping Weight decay

34 Limit network topology Idea: fewer weights less flexibility Analogy to polynomial interpolation:

35 Limit network topology

36 Early stopping Idea: stop training when information (but not noise) is modeled Need validation set to determine when to stop training

37 Early stopping

38 Weight decay Idea: control smoothness of network output by controlling size of weights Add term - α w 2 to error function

39 Weight decay

40 Bayesian perspective Error function minimization corresponds to maximum likelihood (ML) estimate: single best solution w ML Can lead to overtraining Bayesian approach: consider weight posterior distribution p(w D). Advantage: error bars for regression, averaged estimates for classification

41 Bayesian perspective Posterior = likelihood * prior p(w D) = p(d w) p(w)/p(d) Two approaches to approximating p(w D): Sampling Gaussian approximation

42 Sampling from p(w D) prior * likelihood = posterior

43 Gaussian approx. to p(w D) Find maximum w MAP of p(w D) Approximate p(w D) by Gaussian around w MAP Fit curvature:

44 Gaussian approx. to p(w D) Max p(w D) = min log p(w D) = min log p(d w) log p(w) Minimizing first term: finds ML solution Minimizing second term: for zero-mean Gaussian prior p(w) adds term - α w 2 Therefore, adding weight decay amounts to finding MAP solution!

45 Bayesian example for regression

46 Bayesian example for classification

47 Summary ANNs inspired by functionality of brain Nonlinear data model Trained by minimizing error function Goal is to generalize well Avoid overtraining Distinguish ML and MAP solutions

48 Pointers to the literature Lisboa PJ. A review of evidence of health benefit from artificial neural networks in medical intervention. Neural Netw Jan;15(1): Almeida JS. Predictive non-linear modeling of complex data by artificial neural networks. Curr Opin Biotechnol Feb;13(1):72-6. Kononenko I. Machine learning for medical diagnosis: history, state of the art and perspective. Artif Intell Med Aug;23(1): Dayhoff JE, DeLeo JM. Artificial neural networks: opening the black box. Cancer Apr 15;91(8 Suppl): Basheer IA, Hajmeer M. Artificial neural networks: fundamentals, computing, design, and application. J Microbiol Methods Dec 1;43(1):3-31. Bishop, CM. Neural Networks for Pattern Recognition. Oxford University Press 1995.

Support Vector Machines

Support Vector Machines Stephan Dreiseitl University of Applied Sciences Upper Austria at Hagenberg Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support Overview Motivation