Medical Image Recognition Linwei Wang

Size: px

Start display at page:

Download "Medical Image Recognition Linwei Wang"

Garry Dickerson
5 years ago
Views:

1 Medical Image Recognition Linwei Wang

2 Content Classification methods in medical images Statistical Classification: Bayes Method Distance-Based: Nearest Neighbor Method Region-Based: Support Vector Machine AI: Neural Network Dimensionality Reduction Principle Component Analysis

3 Typical Image Recognition Problems Segment the ROIs from an image and classify them Cell analysis Diagnostic support. Given medical data, classify patient

General Model Given data D, identify the

Classification: Decision: D! X X! R L R L!

4 General Model Given data D, identify the classes I and neutral identification i 0 (the lack of decision) D! I "{i 0 } Feature extraction: Classification: Decision: D! X X! R L R L! I "{i 0 } Acquisition Processing Segmentation Features Classification

5 Classification Feature vector! # x =! # " x 1 x n $ & & & % Classes:! 1,! 2! M Construct M functions d i : x! R Classify an element x as class i if d i (x) < d j (x),!j " i Borders between different classes is given by d i (x) = d j (x)

6 Error Functions How can the functions d i be constructed? Statistical classification Nearest Neighbor classification Support Vector Machines Artificial Neural Networks

7 Bayes Statistical Classification Assume we can calculate the probability that x comes from class ω j : P(! j x) If we classify x as class ω j when it is ω i, we obtain a loss L(i, j) The mean loss when classifying x as ω j M r j (x) =! L(k, j) P(! k x) k=1 Discussion: Why there should be a difference for L(i, j) and L(j, i)?

8 Bayesian Classification Assume clinical data x and two classes W 1: impending heart attack, keep patient under surveillance W 2 : healthy, send patient home What is the cost of misclassfying a patient as 1 when he/she is of class 2 (L(2,1))? What is the cost of misclassfying a patient as 2 when he/she is of class 1 (L(1,2))? Should L(1,2) and L(2,1) be the same?

9 Bayesian Classification Using Bayes rule We have P(! k x) = p(x! k )p(! k ) p(x) r j (x) = 1 M! L(k, j) p(x! k )p(! k ) p(x) k=1

10 Example Assume medical test for a disease The test is positive with probability 1 for sick patients The test is positive with probability 0.01 for healthy patients 0.2% of the population is sick Class: w 1 : sick; w 2 : healthy Data: test results What is the probability of sick patients given positive tests?

11 Example Assume medical test for a disease The test is positive with probability 1 for sick patients: P(a w 1 ) = 1 The test is positive with probability 0.01 for healthy patients: : p(a w 2 ) = % of the population is sick: P(w 1 ) = What is the probability of sick patients given positive tests? P(! 1 a) = p(a! 1)p(! 1 ) p(a) = 1* * * = p(a! 1 )p(! 1 ) p(a! 1 )p(! 1 )+ p(a! 2 )p(! 2 )

12 Bayesian Classification Classify x as ω j if r i (x) > r j (x) for all Without any prior knowledge about the application, it is common to choose " $ L(i, j) = 0, i = j # % $ 1, i! j For which one can use as decision function (MAP classification) j! i d j (x) = P(! j x) = p(x! k )p(! k )

13 Example Assume 3 classes!! 1 = # 0 1 " 1 0 $! &, p 1 =1/ 4! 2 = # 1 0 % " 0 1 $! &, p 2 =1/ 4! 3 = # 1 1 % " 1 0 $ &, p 3 =1/ 2 % Assume there is 10% chance for error in one pixel Assume independent errors What is the optimal classification of! x = 0 1 $ # & " 1 1 %

14 Solution!! 1 = # 0 1 "! # " 1 0! 2 = !! 3 = 1 1 # " 1 0 $ &, p 1 =1/ 4 % $ &, p 2 =1/ 4 % $ &, p 3 =1/ 2 % P(! 1 x)! p(x! 1 )p(! 1 ) = " 0.25 # P(! 2 x)! p(x! 2 )p(! 2 ) = " 0.25 # P(! 3 x)! p(x! 3 )p(! 3 ) = " 0.5 # 0.004! x = # 0 1 " 1 1 $ & % The first image is the most probable

15 Probability Distribution In practice, it is common to model the features of each class as Gaussian distribution p(x! 1 ) = N(m 1,! 1 ) Training: Estimate parameters from training data Classification: Assuming the model to be correct and the estimates are exact, use Bayes formula to estimate p( j x) and classify according to minimum loss or MAP

16 Density Estimation Nonparametric Methods From training data, estimate the density for each classes: {x 1, x 2,.x n } à p(x) The probability that a vector x, drawn from p(x), falls into region R of the sample space is: P =! R p(x')dx' When n vectors are observed, the probability that k of them fall into R is! P(k) = # n$ " k% & P k (1' P) n'k

17 Density Estimation According to the properties of the Binomial distribution E[ k n ] = P,Var[k n ] = P(1! P) n As n increases, k/n becomes a good estimator of P If the region R is so small that p does not vary significantly within it (V is the volume) k n = P =! p(x')dx' " p(x)v # p(x) " k R nv

18 Density Estimation How to obtain regions R i? Specify V n to be a function of n, e.g., V n =1/ n Specify k n as a function of n, e.g., k n = n Use V n such that k n samples are contained in the neighborhood

19 Density Estimation

20 Histogram If there is a large amount of training data, one may estimate the density function using histograms The histogram is close to, but not true, density estimation It partitions the sample space into bins, and approximate the density at center of each bin For bin b i, the density is defined as ˆp(x! i ) = k j / n i V Within each bin, the density is assumed to be constant x! b j

21 Parzen Window Define a window function " $!(u) = 1 u j!1/ 2, j =1...d # % $ 0 otherwise which is a unit hypercube centered at origin Volume V n =h n d for a hypercube with edge length h n ϕ((x-x i )/h n ) =1 if x i falls within the hypercube of volume V n centered at x; =0 otherwise

22 Parzen Window The number of samples in the hypercube is The estimate of p(x) is

23 Parzen Window The window function can be general

24 Parzen Window

25 Parzen Window P(x): standard normal h n = h 1 / n

26 P(x): bimodal distribution (mixture of uniform and triangle density) Parzen Window

27 P(x): standard normal Parzen Window

28 Bayesian Classification Estimate the density for each class using parametric or non-parametric method Construct a Bayes classifier using the densities Classify the test object based on the posterior probability and the loss function

29 Nearest Neighbor Classification Distance-based Method Save the entire training set: T = {(x 1,! 1 ),...,(x N,! N )} During classification, we compare the distance between the feature vector x with the ones in the training set and put k = argmin i d(x i, x)! =! k

30 K-Nearest Neighbor Classification A generalization of the NN method Find the k nearest neighbors to the vector x Classify it to the class with the most representatives among these k neighbors, if the number of above I Otherwise classify it as unknown What is the price paid in order to be algorithmically simple compared to Bayes method?

31 Support Vector Machine Region-based method Assume we want to classify features into two classes, ω = 1 or -1 Training dataset T = {(x 1,! 1 ),...,(x N,! N )} x i! R d,! i! {1,"1} Want to learn a classifier: ω = f(x, a) Learning structures from data

32 Support Vector Machine Assume we are choosing from the set of hyperplanes, then we have a vector w and a scalar b: such that f (x,{w, b}) = sign(w T x + b)! i (w T x i + b)! 0,"i

34 Error Functions We can learn f(x, a) by minimizing errors on training data m R training (!) = 1! m l( f (x i,!)," i ) = Training Error i=1 l is the loss function, l(ω, ω ) = 1 if ω ω and 0 otherwise. Based on training data alone, generalization is not guaranteed A restriction must be placed on the functions that we allow.

35 VC Dimension Vapnik & Chervonenkis An upper bound on the true error can be given by the training error + an additional term R(!) < R training (!)+ h(log( 2m h +1)! log(n 4 ) m h is the VC dimension of the set of functions parameterized by a

36 VC Dimension The VC dimension of a set of function is a measure of their capacity If you can describe a lot of different phenomena with a set of functions, then the value of h is large The VC dimension is the maximum number of points that can be separated in all possible ways by that set of functions For hyperplanes in R n, the VC dimension is n+1

37 VC Dimension For hyperplanes in R n, the VC dimension is n+1

38 VC Dimension Simplification of bound Test Error Training Error + Complexity of Set of Functions The complexity of functions is often called a regularizer If a high capacity function set is used (explains a lot), low training error is obtained but there might be problem of overfitting If a very simple set of functions is used, training error won t be low

39 Images from B. Schoelkopf. Capacity of Functions

40 Capacity of Functions

41 Capacity of Hyperplanes Vapnk & Chervonenkis showed that Considering hyperplanes a set of points X * such that: (w T x) = 0 min i (w T x i ) =1 The set of decision functions that w 2! A has a VC dimension satisfying h! R 2 A 2 where w is normalized w.r.t f w (x) = sign(w T x) defined on X * such Where R is the radius of the smallest sphere around the origin containing X * Minimizing Minimizing classifier w 2 2 w 2 2 gives low capacity is equivalent to a large-margin

42 Optimal Separating Hyperplane

43 Optimal Separating Hyperplane To obtain a large margin classifier

44 The distance between a point x i to the hyperplane (w, b) is given by wt x i + b w w is normal to the hyperplane b w is the perpendicular distance from the hyperplane to the origin Minimizing w 2 gives low capacity, equivalent to maximizing the minimal distance to the hyperplane Subject to: Optimal Separating Hyperplane 1 m m! l (w T x i + b,! i )+ w 2 i=1 min{w T x i +b} =1

45 Optimal Separating Hyperplane Assume we can separate the data perfectly min w,b w T w Subject to: (equal to )! i (w T x i + b)!1,"i w T x i + b!1! i = +1 w T x i + b # $1! i = $1 Convex optimization, no local minima Quadratic programming

46 SVM: Non-Separable Case To deal with non-separable cases We need to relax the constraint But only when necessary Introduce positive slack variables ξ i, i=1 l in the constraints (w T x i + b)!1"! i, " i = +1 (w T x i + b) # "1+! i, " i = "1! i! 0,$i w T x i + b!1! i = +1 w T x i + b " #1! i = #1 For an error to occur, ξ i must exceed 1. so!! i is an i upper bound on the number of training errors

47 SVM: Non-Separable Case To deal with non-separable cases Subject to min w,b wt w + C! i This is the same as the original objective function Where l is not 0-1 loss, but hinge-loss Still a convex optimization problem m! i=1! i (w T x i + b)!1"" i, " i! 0 m 1! m l (w T x i + b,! i )+ w 2 i=1

48 SVM: Non-Separable Case

49 Support Vector Machine After finding the optimal hyperplane (w, b), new data are classified as:! =1, if (w T x + b) > 0! =!1, otherwise

50 SVM - Summary Decision function f (x) = w T x + b Primal formulation Hinge loss H(z) = max(0, 1-z) to count the number of errors 1 2 w 2 +C! H(! i f (x i )) i Maximum margin Minimum capacity Minimizing training error

51 SVM Nonlinear Cases Map data into a feature space H that include nonlinear features x!!(x)! : R d! " Then construct a hyperplane in that space H f (x) = w T!(x)+ b Example: Polynomial mapping

SVM Nonlinear Cases Choosing a good mapping can improve the results (prior knowledge + right complexity of function classes) The dimensionality of can be very large, marking w hard to

52 SVM Nonlinear Cases Choosing a good mapping can improve the results (prior knowledge + right complexity of function classes) The dimensionality of can be very large, marking w hard to represent explicitly in memory and hard for the QP to solve If for some variable Support vectors We can directly optimize a N s f (x) =!! i "(s i ) "(x)+ b N s i=1 w =!! i "(s i ) i=1 Kernel function K

53 Simple Examples Linear Example: when is trivial Suppose we have the following +1 labeled data points in R 2 : {( 3 1 ),( 3!1 ),( 6 1 ),( 6!1 )} And the following -1 labled data points in R 2 {( 1 0 ),( 0 1 ),( 0!1 ),(!1 0 )}

54 Examples Linear Cases

55 Examples Linear Cases Linearly separable Three obvious support vectors: {s 1 = ( 1 0 ), s 2 = ( 3 1 ), s 3 = ( 3!1 )} Augment vectors with a 1 as bias input {!s 1 = ( 1 0 ),!s 2 = ( 3 1 ),!s 3 = ( 3!1 )} 1 1 1

56 Examples Linear Cases To find values of so that! 1!(s 1 )!(s 1 )+! 2!(s 2 )!(s 1 )+! 3!(s 3 )!(s 1 ) = "1! 1!(s 1 )!(s 2 )+! 2!(s 2 )!(s 2 )+! 3!(s 3 )!(s 2 ) = +1! 1!(s 1 )!(s 3 )+! 2!(s 2 )!(s 3 )+! 3!(s 3 )!(s 3 ) = +1 Let ()=I! 1!s 1!s 1 +! 2!s 2!s 1 +! 3!s 3!s 1 =!1! 1!s 1!s 2 +! 2!s 2!s 2 +! 3!s 3!s 2 = +1! 1!s 1!s 3 +! 2!s 2!s 3 +! 3!s 3!s 3 = +1

1 = - 3.5; 2 = 0.75; 3 = 0.75; Hyperplane y = wx + b with w = (1 0) T, and b = -2 Examples Linear Cases!w =! i!s i i! = "3.

57 1 = - 3.5; 2 = 0.75; 3 = 0.75; Hyperplane y = wx + b with w = (1 0) T, and b = -2 Examples Linear Cases!w =! i!s i i! = " # $ % % % & ' ( ( ( # $ % % % & ' ( ( ( "1 1 # $ % % % & ' ( ( ( = 1 0 "2 # $ % % % & ' ( ( (

58 Examples Nonlinear Example Suppose we have the following +1 labeled data points in R 2 : {( 2 2 ),( 2!2 ),(!2!2 ),(!2 2 )} And the following -1 labled data points in R 2 {( 1 1 ),( 1!1 ),(!1!1 ),(!1 1 )}

59 Examples Nonlinear Cases

60 SVM Nonlinear Cases Define: )" 4 ( x 2 + x 1 ( x 2 % "! x $ ' x x 2 2 > 2 % 4 ( x 1 + # 1 + x 1 ( x 2 & $ ' = * # x 2 & + " x 1 % + $ ' otherwise,# x 2 &

61 Examples Nonlinear Cases Tow support vectors (in input space) {s 1 = ( 1 1 ), s 2 = ( 2 2 )} Feature space: + {( 2 2 ),( 6 2 ),( 6 6 ),( 2 6 )} -: {( 1 1 ),( 1!1 ),(!1!1 ),(!1 1 )}

62 Examples Nonlinear Cases To find values of so that! 1!(s 1 )!(s 1 )+! 2!(s 2 )!(s 1 ) = "1! 1!(s 1 )!(s 2 )+! 2!(s 2 )!(s 2 ) = +1 Substituting! 1!s 1!s 1 +! 2!s 2!s 1 =!1! 1!s 1!s 2 +! 2!s 2!s 2 = +1

63 1 = - 7; 2 = 4 Hyperplane: y = wx + b with w = (1 1) T, and b = -3 Examples Nonlinear Cases!w =! i!s i i! = " # $ % % % & ' ( ( ( # $ % % % & ' ( ( ( = 1 1 "3 # $ % % % & ' ( ( (

64 Example: Nonlinear Cases Now we would use this SVM model to classify given data x f (x) =! (!" i "(s i ) "(x)) where returns the sign Example, given x = (4, 5) T i=1 f! 4 " # $ 5 % & =! ('7(! 1 " # $ 1% & (! 4 " # $ 5% & + 4(! 2 " # $ 2% & (! # 4$ " 5 % & ) =! ('2)

65 Example: Nonlinear Cases Is it a reasonable classification? Not generalized The likely culprit is the choice of Consider an alternative mapping: HW 3 " $ "! x % $ 1 $ ' = $ # x 2 & $ $ # x 1 x 2 (x x 2 2 )( 5 3 % ' ' ' ' ' &

66 Nonlinear SVM The Kernel Trick Decision: N s f (x) =!! i "(s i ) "(x)+ b i=1 Optimization / learning N s Kernel function K is used to make implicit nonlinear feature map Polynomial kernel =!! i K(s i, x)+ b i=1 N s min{ 1! 2!! i"(s i ) 2 +C! H(" i f (x i ))} i=1 K(x, x') = (x x') d i K(x, x') = (x x'+1) d RBF kernel K(x, x') = exp(!! x! x' 2 )

67 Polynomial SVMs The kernel K(x, x') = (x x') gives the same results as the d explicit mapping + dot product we described earlier:!((x 1, x 2 ))!((x' 1, x 2 ')) = (x 2, 1 2x x, 1 2 x2 2 ) (x'2, 1 2x' x', x '2 ) = x 2 1 x' x 1 x' 1 x 2 x' 2 + x x' 2 K(x, x') = (x x') 2 = (x 1 x 1 ' + x 2 x 2 ' ) 2 = x 1 2 x' x 1 x' 1 x 2 x' 2 + x 2 2 x' 2 2 Though if d is large, the kernel still only requires n multiplications whereas the explicit representation may not fit in memory Decision: N s f (x) =!! i (s i x) d + b i=1

68 RBF-SVMs The RBF (radial basis function) kernel is one of the most popular kernel functions. It maps the data into a infinite Hilbert space It adds a bump around each data point: K(x, x') = exp(!! x! x' 2 ) Decision: f (x) = m!! i exp("" x i " x' 2 )+ b i=1 Using this one can get state-of-the-art results

69 Artificial Neural Network Build a classifier function d(x) using simple but general parts Each part is a model of a neuron, having a number of parameters Learning the parameters, one may build a large class of functions Try to find the parameters that makes the classification as good as possible

70 What Does ANN Rely On? Our crude way to simulate the brain electronically Multiple inputs Weights can be negative to represent excitory or inhibitory influences Output: activation

71 Artificial Neural Network Mathematically, an artificial neuron can be modeled as: d(x) = f (w T x + w 0 ) where f is a non-linear function (transfer function), e.g. Threshold Sigmoid " $ f (y) = 0 y < 0 # % $ 1 y! 0 1 f (y) = 1+ e &cy f (y) = 1! atan(y)+ 1 2

72 Artificial Neural Network Put all the neurons into a network Arbitrary number of layers It is common to use two layers Arbitrary number of ANs in each layer Feed-forward ANN Signals travel one way only: from input to output Feedback ANN Signals travel in both directions through loops in the network Very powerful and can get extremely complicated

73 Feed-Forward Neural Network Common ANNs have 3 layers Input layer: x Hidden layer: z Output layer: y z i = f (w i T x + w i,0 ) y i = d(x) = f (v i T z + v i,0 )

74 The Learning Process Each neural network possesses knowledge contained in the values of the connected weights Modifying the knowledge stored in the network as a function of experiences implies a learning rule for changing the values of the weights Fixed networks: the weights cannot be changed and fixed a priori according to the problem solved Adaptive networks: able to change their weights Supervised learning: LMS convergence Unsupervised learning (self-organisation) min (v,w) d i (x j )!! i. j

75 An Non-Medical Example Assume we want an ANN to recognize hand-written digits on an array of 256 sensors 256 input units; 10 output units; hidden units To train Present an image and compare the actual activity of the 10 outputs units with the desired activity Calculate the error (square of the difference) Change the weight to reduce the error How to calculate the error derivative for the weight in order to change the weight by an amount proportional to the rate at which the error changes as the weight is changed Back-propagation algorithm

76 Back-Propagation Algorithm To adjust the weights for each unit such that the error between the desired output and the actual output is reduced Learning using the method of gradient descent To compute the gradient of the error function, i.e., error derivative of the weights EW (how the error changes as each weight is increased or decreased slightly) Must guarantee the continuity and differentiability of the error function Activation function: e.g. sigmoid function f (x) = 1 df (x) = 1+ e!cx dx e!x (1+ e!cx ) 2 = f (x)(1! f (x))

77 Back-Propagation Algorithm Activation function: e.g. sigmoid function Guarantee the continuity and differentiability of the error function Valued between 0 and 1 Local minima could occur 1 f (x) = n 1+ exp( w i x i + w 0 )! i=1

78 Back-Propagation Algorithm Find a local minimum of the error function E = 1 m! ( y 2 i=1 i " ŷ i ) 2 ANN is initialized with randomly chosen weights The gradient of the error function, EW, is computed and used to correct the initial weights EW is computed recursively!e = ( "E "w 1, "E "w 2,..., "E "w l )!w i = "! #E #w i i =1,...,l Assume l weights in the network r: learning constant, defines the step length of each iteration in the negative gradient direction

79 Back-Propagation Algorithm Now let s forget about training sets and learning Our objective is to find a method for efficiently calculating the gradient of network function according to the weights of the network Because our network = a complex chain of function compositions (addition, weighted edge, nonlinear activation), we expect the chain rules of calculus to play a major role in finding the gradient Let s start with a 1D network

80 B-Diagram Feed-forward step: Info comes from the left and each unit evaluates function f in its right side and the derivation f in left side Both results are stored in the unit, only that on the right side is transmitted to the units connected to the right Backpropagation step Running the whole network backwards, using the stored results Derivative of the function Single computing unit function Separation into addition and activation unit

81 Three Basic Cases Function composition Forward: Backward: The input from the right of the network is the constant 1 Incoming info is multiplied by the value stored in its left side The results (traversing value) is the derivative of the function composition Any sequence of function compositions can be evaluated in this way & its derivative obtained in the backpropagation step The network being used backwards with the input 1 At each node the product with the value stored in the left side is computed

82 Three Basic Cases Function addition Forward: Backward: All incoming edges to a unit fan out the traversing value at this node and distribute it to the connected units to the left When two right-to-left paths meet, the computed traversing values are added

83 Three Basic Cases Weighted edges Forward: Backward

84 Steps of the Backpropagation Algorithm Consider a network with a signal real input x and network function F, the derivative F (x) is computed in two phases: Feedforward: The input x is fed into the network. The primitive functions at the nodes and their derivatives are evaluated at each node & stored Backpropagation: The constant 1 is fed into the output unit and the network is fun backwards. Incoming info to a node is added and the result is multiplied by the values stored in the left part of the unit. The result is transmitted to the left of the unit. The result collected at the input unit is the F (x) We can prove that it works in arbitrary feed-forward networks with differentiable activation functions at the nodes

85 Steps of the Backpropagation Algorithm F(x) =!(s)(w 1 F 1 (x)+ w 2 F 2 (x) w m F m (x)) F '(x) =! '(s)(w 1 F ' 1 (x)+ w 2 F ' 2 (x) w m F ' m (x))

86 Generalization to More Inputs The feed-forward step remains unchanged & all left side slots of the units are filled as usual In the backpropagation we can identify two subnetworks

Learning with Backpropagation The feed-forward step is computed in the usual way, but we also store the output of each unit in its right side We perform the backpropagation in the network If we fix

87 Learning with Backpropagation The feed-forward step is computed in the usual way, but we also store the output of each unit in its right side We perform the backpropagation in the network If we fix our attention on one of the weights, say w ij whose associated edge points from the i-th to the j-th node in the network The weight can be treated as input channel into the subnetwork made of all paths starting at w ij and ending in the single output unit of the network The info fed into the subnetwork in the feed-forward step was o i w ij (o i the stored output of unit i) The backpropagation computes the gradient of error E with respect to this input!e!e = o i!w ij!o i w ij Usual result in backpropagation at one node with regard to one input

88 Learning with Backpropagation The backpropagation is performed in the usual way. All subnetworks defined by each weight of the network can be handled simultaneously, but we store additionally at each node i The output o i of the node in the feed-forward step The cumulative result of the backward computation up to this node (backpropagated error δ j )!E!w ij = o i! j Once all partial derivatives are computed, we can perform gradient descent by adding to each weight:!w ij = "!o i " j

89 Layered Networks Notation: n input, k hidden, m output Weights and matrix W 1,W 2 The excitation net of the j-th hidden units n+1 (1) net j =! w ij ô i i=1 The outputs of this unit = s( n+1 w (1) j! ij ô i ) o (1) i=1 In matrix form o (1) = s(ôw 1 ) o (2) = s(ô (1) W 2 )

90 Layered Networks Let s consider a single input-output pair (o,t), i.e., 1 training set Backpropagation Feedforward computation Backpropagation to the output layer Backpropagation to the hidden layer Weights updates Stops when the value of the error functions is sufficiently sall Extended network for computing error

91 Layered Networks Feedward computation The vector o is presented to the network, the vectors o (1) and o (2) are computed and stored. The derivatives of the activation functions are also stored at each unit Backpropagation to the output layer!e Interested in!w (2) ij Extended network for computing error

92 Layered Networks Backpropagation to the output layer Interested in!e /!w (2) ij Bakpropagated error! j (2) = o j (2) (1! o j (2) )(o j (2)! t j ) Partial derivative!e /!w (2) (1) ij = o i! (2) j = [o (2) j (1" o (2) j )(o (2) (1) j " t j )]o i

93 Layered Networks Backpropagation to the hidden layer Interested in!e /!w (1) ij Bakpropagated error! (1) j = o (1) j (1! o (1) m j ) w (2) (2) " jq! q q=1 Partial derivative!e /!w (1) (1) ij = o i! j

94 Layered Networks Weights update Hidden-output layer!w (2) = "!o (1) ij i " (2) j, i =1,..., k +1; j =1,...m Input-hidden layer o n+1 = o (1) k+1 =1!w (1) ij = "!o i " j (1), i =1,..., n +1; j =1,...k Make the corrections to the weight only after the backpropagated error has been computed for all units in the network!!!! Otherwise the corrections become interwined with the backpropagation, and the computed corrections do not correspond to the negative gradient direction

95 More than One Training Set If we have p datasets Batch / offline updates! 1 w (1) ij,! 2 w (1) (1) ij,...! p w ij The necessary updates:!w (1) ij =! 1 w (1) ij +! 2 w (1) (1) ij ! p w ij Online / sequential updates The corrections do not exactly follow the negative gradient direction If the training sets are selected randomly, the search direction oscillates around the exact gradient direction and, on average, the algorithm implements a form of descent in the error function Adding some noise to the gradient function can help to avoid falling into shallow local minima It is very expensive to compute the exact gradient direction when the training set is large

96 Backpropagation in Matrix Forms Input-output: o (2) = s(ô (1) W 2 ) o (1) = s(ô W 1 ) The derivatives (stored in the feed-forward step) o (2) 1 (1! o (2) 1 ) o (1) 1 (1! o (1) 1 ) o (2) D 2 = ( 2 (1! o (2) 2 ) o (1) ) D 1 = ( 2 (1! o (1) 2 )... 0!! "!!! "! o (2) m (1! o (2) m ) o (1) k (1! o (1) k ) ) The stored derivatives of the quadratic error " o (2) 1! t 1 % $ ' $ o (2) e = 2! t 2 ' $! ' $ # o (2) ' m! t m &

97 Backpropagation in Matrix Forms The m-dimensional vector of the backpropagated error up to the output units! (2) = D 2 e The k-dimensional vectors of the backpropagated error up to the hidden layer! (1) = D 1 W 2! (2) The correction for the two weight matrices!w T 2 = "!" (2) ô (1),!W T 1 = "!" (1) ô We can generalize this for l-layers! (l) = D l e Or! (i) = D i W i+1! (i+1) i =1,..l!1! (i) = D i W i+1...w l!1 D l!1 W l D l e

98 Back-Propagation Summary First, compute EA: rate at which the error changes as the activity level of a unit is changed Output layer: the difference between the actual and desired outputs Hidden layer: Identifying all the weights between that hidden unit and the outputs it connected with Multiply those weights by the EAs of those output units and add the products Other layers: Similar fashion, calculated from layer to layer in a direction opposite to the way activities propagate through the network (hence the name back-propagation) Second, EW for each connection of the unit is the product of the EA and the activity through the incoming connection

99 ANN in Medicine Modeling parts of the human body and recognizing diseases from various data (such as images) No need to provide a specific algorithm on how to identify the disease Learn by example, so details of how to recognize the disease are not needed Need A set of samples that are representative of all the variation of the disease Needed to be selected very carefully if the network is to perform reliably and efficiently

100 Diagnosis of Interstital Lung Diseases by ANN

101 ANN Inputs: Clinical Parameters Age Sex Duration of symptoms Severity of symptoms Temperature Immune status Underlying malignancy Smoking Dust exposure Drug treatment

102 ANN Inputs: Radiological Findings Location of infiltrate Details of infiltrate Homogeneity Fineness / coarseness Nodularity Septal lines Honeycombing Loss of lung volume Related thoracic abnormality Lymphadenopathy Pleural effusion Heart size

103 ANN Outputs: Interstitial Lung Diseases Sarcoidosis Miliary tuberculosis Lymphangitic carcinomatosis Interstitial pulmonary edema Silicosis Scleroderma Pneumocystis carinii pneumonia (PCP) Eosinophilic granuloma (EG) Idiopathic pulmonary fibrosis (IPF) Viral pneumonia Pulmonary drug toxicity

104 Case I: Idiopathic Pulmonary Fibrosis ANN output (IPF)

105 Case II: Miliary Tuberculosis ANN output

106 Comparison of Diagnosis

107 ANN in Medicine It can mimic the relationship among physiological activities at different physical activity levels It can be adapted to an individual without the supervision of an expert Potential harmful medical conditions can be detected at an early stage The ability to provide sensor/data fusion Combining values from several different resources Learn complex relationships between these individual values that would otherwise be lost if the values were individually analyzed

108 Reduction: Feature Selection In medical images, there is often too much information For practical reasons one has to select a subset of these features Most commonly used methods Singular value decomposition (SVD) Intuition and knowledge of the problem

109 Invariants Sometimes there are great variations in the feature vectors x that depends on object position, lighting, etc Definition of invariants Let x 1, x 2.. x n be a set of points and G a class of transformations. A function f : (x 1, x 2,..., x n )! R is an invariant under the transformation group G if f (x 1, x 2,..., x n ) = f (T(x 1 ),T(x 2 ),...,T(x n )) for every transformation T in G.

110 Invariants For example G = Euclidean transformations (rotation + translation) è Euclidean invariant G = similarity transformations (rotation + translation + global scale) è similarity invariant G = affine transformations è affine invariant G = projective transformations è projective invariant

111 Principle Component Analysis Assume we have a number of data (as samples of a random variable) (x 1, x 2,..., x n ) x i! R d Statistics µ = 1 Mean n Covariance matrix! i x i P = 1 "(x i! µ )(x i! µ) T n!1 i

112 PCA Assume we want to find a feature v(x) that project x on one dimension, e.g., v(x) = v T (x! µ) A good feature should describe as much of the variance as possible. The variance of v(x) is given by V(v) = v T Pv Maximizing it with the constraint v T v = 1 is the same thing as finding the largest eigenvector

113 PCA Additional features are obtained by taking the second largest eigenvector, etc. If we want to project the features of x of dimension n to a subspace v of dimension k (k<n) so that as much of the variance is kept, we should choose v = (v 1 T x, v 2 T x,...v k T x) T where v i is the eigenvectors of the k largest eigenvalues of the covariance matrix of x The features also become independent

C4 Phenomenological Modeling - Regression & Neural Networks : Computational Modeling and Simulation Instructor: Linwei Wang

C4 Phenomenological Modeling - Regression & Neural Networks 4040-849-03: Computational Modeling and Simulation Instructor: Linwei Wang Recall.. The simple, multiple linear regression function ŷ(x) = a