Deep Learning, Data Irregularities and Beyond

Size: px
Start display at page:

Download "Deep Learning, Data Irregularities and Beyond"

Transcription

1 Deep Learning, Data Irregularities and Beyond Dr. Swagatam Das Electronics and Communication Sciences Unit, Indian Statistical Institute, Kolkata , India.

2 Road Map Basic concepts Perspectives in Deep Learning Data Irregularities what does that mean?!! Imbalanced Classification Problems Small Disjuncts, Class Skew, and Feature-based Iregularities Problems. A new dissimilarity measure for missing features Miles to go before we sleep.. 2

3 Classifier The Supervised Learner Data: A set of data records (also called examples, instances or cases) described by k attributes: A 1, A 2, A k. a class: Each example is labelled with a pre-defined class (labels act like teachers!) Goal: To learn a classification model from the data that can be used to predict the classes of new (future, or test) cases/instances. 3

4 An example: data (loan application) Approved or not 4

5 An example: the learning task Learn a classification model from the data Use the model to classify future loan applications into Yes (approved) and No (not approved) What is the class for following case/instance? 5

6 6

7 7

8 Deep Learning: Acknowledging Some Resources used here Neural Networks and Deep Learning written by Michael Nielsen Deep Learning Written by Ian J. Goodfellow, Yoshua Bengio, and Aaron Courville Deep Learning Tutorial. Hung-yi Lee, NTU.

9 Deep Learning Some Definitions Algorithms that use a cascade of multiple layers of nonlinear processing units for feature extraction and transformation. Each successive layer uses the output from the previous layer as input. learn in supervised (e.g., classification) and/or unsupervised (e.g., pattern analysis) manners. learn multiple levels of representations that correspond to different levels of abstraction; the levels form a hierarchy of concepts. use some form of gradient descent for training via backpropagation. 9

10 Developments in Neural Learning Systems 10

11 Deep Learning - Outline Part I: Introduction of Deep Learning Part II: Why Deep? Part III: Tips for Training Deep Neural Network Part IV: Neural Network with Memory

12 Part I: Introduction of Deep Learning What people already knew in 1980s

13 Example Application Handwriting Digit Recognition Machine 2

14 Handwriting Digit Recognition Input Output x 1 y is 1 16 x 16 = 256 Ink 1 No ink 0 x 2 x 256 y y is 2 is 0 The image is 2 Each dimension represents the confidence of a digit.

15 Example Application Handwriting Digit Recognition x 1 y 1 x 2 x 256 y 2 Machine 2 y 10

16 Element of Neural Network Neuron a 1 w 1 z a w 1 1 a 2 w 2 a K w K b a 2 a K w 2 w K weights b bias z z Activation function a

17 Neural Network neuron Input x 1 Layer 1 Layer 2 Layer L Output y 1 x 2 y 2 x N y M Input Layer Hidden Layers Output Layer Deep means many hidden layers

18 Example of Neural Network Sigmoid Function z 1 1 e z z z

19 Example of Neural Network

20 Example of Neural Network Different parameters define different function

21 Matrix Operation y y 2

22 Neural Network x 1 y 1 x 2 x N W 1 W 2 W L b 1 b 2 b L x a 1 a 2 y y 2 y M W 1 x + b 1 W 2 a 1 + b 2 W L a L-1 + b L

23 Neural Network x 1 y 1 x 2 x N W 1 W 2 W L b 1 b 2 b L x a 1 a 2 y y 2 y M y x Using parallel computing techniques to speed up matrix operation W L W 1 x + b 1 b 2 + W 2 + b L

24 Softmax Softmax layer as the output layer Ordinary Layer z 1 z 2 y y 1 z 1 2 z 2 In general, the output of network can be any value. z 3 y 3 z 3 May not be easy to interpret

25 Softmax Softmax layer as the output layer Softmax Layer z 1 z 2 z 3 3 e 1 e e e z 1 20 e z 2 e z 3 3 j z j e 0.88 y y 0 y 2 3 e e e z z z j 1 3 j 1 3 j 1 e e e z z z j j j

26 How to set network parameters x 1 y is 1 16 x 16 = 256 Ink 1 No ink 0 x 2 x 256 Input: Input: Softmax How to let the neural network achieve this y y 10 is 2 is 0 y 1 has the maximum value y 2 has the maximum value

27 Training Data Preparing training data: images and their labels Using the training data to find the network parameters.

28 Cost 1 x 1 y x 2 x 256 y y Cost 0 0 Cost can be Euclidean distance or cross entropy of the network output and target target

29 Total Cost For all training data Total Cost: x 1 NN y 1 x 2 NN y 2 x 3 NN y 3 x R NN y R

30 Gradient Descent Error Surface Assume there are only two parameters w 1 and w 2 in a network. The colors represent the value of C.

31 Gradient Descent Eventually, we would reach a minima..

32 Local Minima Gradient descent never guarantee global minima Reach different minima, so different results Who is Afraid of Non-Convex Loss Functions? _lecun_wia/

33 Besides local minima cost Very slow at the plateau Stuck at saddle point Stuck at local minima parameter space

34 In physical world Momentum How about put this phenomenon in gradient descent?

35 Momentum cost Still not guarantee reaching global minima, but give some hope Movement = Negative of Gradient + Momentum Negative of Gradient Momentum Real Movement Gradient = 0

36 Mini-batch Mini-batch Mini-batch x 1 NN y 1 Pick the 1 st batch x 31 NN y 31 Pick the 2 nd batch x 2 NN y 2 x 16 NN y 16 C is different each time when we update parameters!

37 Mini-batch Original Gradient Descent With Mini-batch unstable The colors represent the total C on all training data.

38 Mini-batch Mini-batch Mini-batch Faster Better! x 1 NN y 1 Pick the 1 st batch x 31 NN y 31 Pick the 2 nd batch x 2 NN y 2 x 16 NN y 16 Until all mini-batches have been picked one epoch Repeat the above process

39 Backpropagation A network can have millions of parameters. Backpropagation is the way to compute the gradients efficiently (not today) Ref: 5_2/Lecture/DNN%20backprop.ecm.mp4/index.html Many toolkits can compute the gradients automatically Ref: ture/theano%20dnn.ecm.mp4/index.html

40 Part II: Why Deep?

41 Deeper is Better? Layer X Size Word Error Rate (%) 1 X 2k X 2k X 2k X 2k 17.8 Layer X Size Word Error Rate (%) Not surprised, more parameters, better performance 5 X 2k X X 2k X X 16k 22.1 Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech

42 Universality Theorem Any continuous function f f : R N R M Can be realized by a network with one hidden layer (given enough hidden neurons) Reference for the reason: eplearning.com/chap4.html Why Deep neural network not Fat neural network?

43 Fat + Short v.s. Thin + Tall The same number of parameters Which one is better? x1 x x 2 N Shallow x1 x x 2 N Deep

44 Fat + Short v.s. Thin + Tall Layer X Size Word Error Rate (%) 1 X 2k X 2k X 2k X 2k 17.8 Layer X Size Word Error Rate (%) 5 X 2k X X 2k X X 16k 22.1 Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech

45 Why Deep? Deep Modularization Image Classifier 1 Classifier 2 Classifier 3 Classifier 4 weak Girls with long hair Boys with long hair Girls with short hair Boys with short hair Long hair Long women hair Long Long womenhair hair women women Cheiko Little examples Short hair Short woman hair Short womanhair Short womanhair woman Short Short hair hair Short male male hair Short male hair male

46 Why Deep? Each basic classifier can have sufficient training examples. Deep Modularization Image Boy or Girl? Basic Classifier Long or short? Classifiers for the attributes Long hair Long Short woman hair Long Long hair Short womanhair hair woman hair Short woman womanhair Short woman womanhair woman Long hair Long woman hair Long Long womanhair hair woman woman Chieko v.s. v.s. Chieko Short Short hair hair Short male male hair Short male hair male Short hair Short woman hair Short womanhair Short woman hair Short woman Short hair hair Short male male hair Short male hair male

47 Why Deep? can be trained by little data Deep Modularization Classifier 1 Girls with long hair Image Boy or Girl? Basic Classifier Long or short? Sharing by the following classifiers as module Classifier 2 fine Classifier 3 Classifier 4 Boys with long Little hair data Girls with short hair Boys with short hair

48 Why Deep? Deep Learning also works on small data set like TIMIT. Deep Modularization x 1 Less training data? x 2 x N The modularization is automatically learned from data. The most basic classifiers Use 1 st layer as module to build classifiers Use 2 nd layer as module

49 Hand-crafted kernel function SVM Deep Learning x 1 Apply simple classifier Source of image: Learnable kernel simple classifier y 1 x 2 y 2 x N y M

50 Hard to get the power of Deep Before 2006, deeper usually does not imply better.

51 Part III: Tips for Training DNN

52 Recipe for Learning

53 Recipe for Learning Don t forget! Modify the Network Better optimization Strategy Preventing Overfitting overfitting

54 Recipe for Learning Modify the Network New activation functions, for example, ReLU or Maxout Better optimization Strategy Adaptive learning rates Prevent Overfitting Dropout Only use this approach when you already obtained good results on the training data.

55 Convolutional Neural Networks (CNNs)

56 Smaller Network: CNN We know it is good to learn a small model. From this fully connected model, do we really need all the edges? Can some of these be shared?

57 Consider learning an image: Some patterns are much smaller than the whole image Can represent a small region with fewer parameters beak detector

58 Same pattern appears in different places: They can be compressed! What about training a lot of such small detectors and each detector must move around. upper-left beak detector They can be compressed to the same parameters. middle beak detector

59 A convolutional layer A CNN is a neural network with some convolutional layers (and some other layers). A convolutional layer has a number of filters that does convolutional operation. Beak detector A filter

60 Convolution These are the network parameters to be learned Filter 1 Filter 2 6 x 6 image Each filter detects a small pattern (3 x 3).

61 Convolution stride= Filter x 6 image Dot product 3-1

62 Convolution If stride= x 6 image Filter 1

63 Convolution stride= Filter x 6 image

64 Convolution stride= x 6 image Filter Repeat this for each filter Feature Map Two 4 x 4 images Forming 2 x 4 x 4 matrix

65 Color image: RGB 3 channels Color image Filter Filter

66 Convolution v.s. Fully Connected image convolution x 1 Fullyconnected x 2 x 36

67 Filter x 6 image fewer parameters! : : Only connect to 9 inputs, not fully connected

68 Filter x 6 image Fewer parameters Even fewer parameters 1: 1 2: 0 3: 0 4: 0 7: 0 8: 1 9: 0 10: 0 13: 0 14: 0 15: 1 16: Shared weights

69 The whole CNN cat dog Convolution Fully Connected Feedforward network Max Pooling Convolution Can repeat many times Max Pooling Flattened

70 Max Pooling Filter Filter

71 Why Pooling Subsampling pixels will not change the object bird bird Subsampling We can subsample the pixels to make image smaller fewer parameters to characterize the image

72 A CNN compresses a fully connected network in two ways: Reducing number of connections Shared weights on the edges Max pooling further reduces the complexity

73 Max Pooling x 6 image Conv Max Pooling New image but smaller x 2 image Each filter is a channel

74 The whole CNN A new image Smaller than the original image The number of channels is the number of filters 1 3 Convolution Max Pooling Convolution Max Pooling Can repeat many times

75 The whole CNN cat dog Convolution Fully Connected Feedforward network Max Pooling Convolution A new image Max Pooling Flattened A new image

76 Flattening Flattened Fully Connected Feedforward network 0 3

77 CNN in Keras Only modified the network structure and input format (vector -> 3-D tensor) input Input_shape = ( 28, 28, 1) There are 25 3x3 filters. Convolution Max Pooling 28 x 28 pixels 1: black/white, 3: RGB Convolution Max Pooling -3 1

78 CNN in Keras Only modified the network structure and input format (vector -> 3-D array) 1 x 28 x 28 Input How many parameters for each filter? 9 25 x 26 x 26 Convolution Max Pooling 25 x 13 x 13 How many parameters for each filter? 225= 25x9 50 x 11 x 11 Convolution Max Pooling 50 x 5 x 5

79 CNN in Keras Output Only modified the network structure and input format (vector -> 3-D array) 1 x 28 x 28 Input Convolution 25 x 26 x 26 Fully connected feedforward network 25 x 13 x 13 Max Pooling 1250 Flattened Convolution 50 x 11 x 11 Max Pooling 50 x 5 x 5

80 AlphaGo Neural Network Next move (19 x 19 positions) 19 x 19 matrix Black: 1 white: -1 none: 0 Fully-connected feedforward network can be used But CNN performs much better

81 AlphaGo s policy network The following is quotation from their Nature article: Note: AlphaGo does not use Max Pooling.

82 Frequency CNN in speech recognition CNN The filters move in the frequency direction. Image Time Spectrogram

83 CNN in text classification? Source of image: d?doi= &rep=rep1&type=pdf

84 Part III: Tips for Training DNN New Activation Function

85 ReLU Rectified Linear Unit (ReLU) *Xavier Glorot, AISTATS 11+ *Andrew L. Maas, ICML 13+ *Kaiming He, arxiv 15+ Reason: 1. Fast to compute 2. Biological reason 3. Infinite sigmoid with different biases 4. Resistant to vanishing gradient problem

86 Vanishing Gradient Problem x 1 x N In x2006, 2 people used RBM pre-training. In 2015, people use ReLU. y 1 y 2 y M Smaller gradients Learn very slow Almost random Larger gradients Learn very fast Already converge based on random!?

87 Vanishing Gradient Problem Smaller gradients x 1 x 2 Small output x N Large input Intuitive way to compute the gradient

88 ReLU 0 x 1 x y 1 y 2 0

89 ReLU A Thinner linear network x 1 y 1 x 2 y 2 Do not have smaller gradients

90 Maxout ReLU is a special cases of Maxout Learnable activation function *Ian J. Goodfellow, ICML 13+ Input x neuron Max + + Max x Max + Max You can have more than 2 elements in a group.

91 Maxout ReLU is a special cases of Maxout Learnable activation function *Ian J. Goodfellow, ICML 13+ Activation function in maxout network can be any piecewise linear convex function How many pieces depending on how many elements in a group 2 elements in a group 3 elements in a group

92 Part III: Tips for Training DNN Adaptive Learning Rate

93 Learning Rate Set the learning rate η carefully If learning rate is too large Cost may not decrease after each update

94 Learning Rate Can we give different parameters Set the learning different learning rate η carefully rates? If learning rate is too large Cost may not decrease after each update If learning rate is too small Training would be too slow

95 Adagrad Original Gradient Descent Each parameter w are considered separately Parameter dependent learning rate constant Summation of the square of the previous derivatives

96 Adagrad g0 0.1 Learning rate: g1 0.2 g g Learning rate: Observation: 1. Learning rate is smaller and smaller for all parameters 2. Smaller derivatives, larger Why? learning rate, and vice versa

97 Larger derivatives Smaller Learning Rate Smaller Derivatives Larger Learning Rate 2. Smaller derivatives, larger learning rate, and vice versa Why?

98 Not the whole story Adagrad *John Duchi, JMLR 11+ RMSprop Adadelta *Matthew D. Zeiler, arxiv 12+ Adam *Diederik P. Kingma, ICLR 15+ AdaSecant *Caglar Gulcehre, arxiv 14+ No more pesky learning rates *Tom Schaul, arxiv 12+

99 Part III: Tips for Training DNN Dropout

100 Dropout Pick a mini-batch Training: Each time before computing the gradients Each neuron has p% to dropout

101 Pick a mini-batch Dropout Training: Thinner! Each time before computing the gradients Each neuron has p% to dropout The structure of the network is changed. Using the new network for training For each mini-batch, we resample the dropout neurons

102 Dropout Testing: No dropout If the dropout rate at training is p%, all the weights times (1-p)%

103 Dropout - Intuitive Reason My partner can get in the way, so I have to do it well When teams up, if everyone expect the partner will do the work, nothing will be done finally. However, if you know your partner will dropout, you will do better. When testing, no one dropout actually, so obtaining good results eventually.

104 Dropout - Intuitive Reason Why the weights should multiply (1-p)% (dropout rate) when testing? Training of Dropout Assume dropout rate is 50% Testing of Dropout No dropout Weights from training Weights multiply (1-p)%

105 Dropout is a kind of ensemble. Ensemble Training Set Set 1 Set 2 Set 3 Set 4 Network 1 Network 2 Network 3 Network 4 Train a bunch of networks with different structures

106 Dropout is a kind of ensemble. Ensemble Testing data x Network 1 Network 2 Network 3 Network 4 y 1 y 2 y 3 y 4 average

107 Dropout is a kind of ensemble. minibatch 1 minibatch 2 minibatch 3 minibatch 4 Training of Dropout M neurons 2 M possible networks Using one mini-batch to train one network Some parameters in the network are shared

108 Dropout is a kind of ensemble. Testing of Dropout testing data x All the weights multiply (1-p)% y 1 y 2 y 3 average y

109 More about dropout More reference for dropout *Nitish Srivastava, JMLR 14+ *Pierre Baldi, NIPS 13+*Geoffrey E. Hinton, arxiv 12+ Dropout works better with Maxout *Ian J. Goodfellow, ICML 13+ Dropconnect [Li Wan, ICML 13] Dropout delete neurons Dropconnect deletes the connection between neurons Annealed dropout *S.J. Rennie, SLT 14+ Dropout rate decreases by epochs Standout *J. Ba, NISP 13+ Each neural has different dropout rate

110 Part IV: Neural Network with Memory

111 Neural Network needs Memory Name Entity Recognition Detecting named entities like name of people, locations, organization, etc. in a sentence. apple DNN people location organization none

112 Neural Network needs Memory Name Entity Recognition Detecting named entities like name of people, locations, organization, etc. in a sentence. target ORG NONE y 1 y 2 y 3 y 4 y 5 y 6 y 7 DNN DNN DNN DNN DNN DNN DNN x 1 x 2 x 3 x 4 x 5 x 6 x 7 the president of apple eats an apple DNN needs memory! target

113 Recurrent Neural Network (RNN) y 1 y 2 The output of hidden layer are stored in the memory. copy a1 a2 Memory can be considered as another input. x1 x2

114 RNN W o y 1 y 2 y 3 W o copy copy W a o 1 a 2 a 3 a 1 a 2 W i W h W W i h W i x 1 x 2 x 3 The same network is used again and again. Output y i depends on x 1, x 2, x i

115 RNN How to train? target target target L 1 L 2 L 3 y 1 y 2 y 3 W o W h W o W h W o W i W i x 1 x 2 x 3 W i Find the network parameters to minimize the total cost: Backpropagation through time (BPTT)

116 Of course it can be deep y t y t+1 y t+2 x t x t+1 x t+2

117 Bidirectional RNN x t x t+1 x t+2 y t y t+1 y t+2 x t x t+1 x t+2

118 Boosting Weak classifier Weak classifier Combine Weak classifier Deep Learning Weak classifier Boosted weak classifier Boosted Boosted weak classifier x 1 x 2 x N

119 Deep Learning to Optimize!! The recent machine learning boom is caused by a move from hand-designed features to learned features. In spite of this, optimization algorithms are still designed by hand. Can we cast the design of an optimization algorithm as a learning problem, allowing the algorithm to learn to exploit structure in the problems of interest in an automatic way? A recent NIPS paper from Google DeepMind says yes we can.they use LSTM: Andrychowicz M. et al. (2016). Learning to learn by gradient descent by gradient descent. NIPS The authors make a clever replacement: Learn the updating function g By LSTM Can we learn the optimal adaptation rules for EAs, instead of hand-crafting them?

120 Data Irregularities Confused Classifiers!! Data Irregularities Distribution based Feature based Class Imbalance Small Disjuncts Missing Features Absent Features Class Skew

121 Class Imbalance Large fraction of representation from at least one class and/or small fraction of representation from at least one class. Standard learners are biased towards the majority class Actual decision boundary Ideal decision boundary

122 Phishing 1. For 1 hour, Google collects 1M s randomly 2. They pay people to label them as phishing or notphishing 3. They give the data to you to learn to classify s as phishing or not 4. You try out a few of your favorite classifiers 5. You achieve an accuracy of %... Hurá!!!

123 Class Imbalance Usual handling techniques: Different Costs (Veropoulos et al., 1999) Costly cost set tuning required. Over-sampling SMOTE (Synthetic Minority Over-sampling Technique) (Chawla et al., 2002) Training complexity increases, the extent of SMOTE to be employed has to be tuned, can lead to overgeneralization of the minority class and can be problematic for highly skewed distribution of the classes. Under-sampling RUS (Random Under-sampling) (Japkowicz, 2000) Critical points may be lost, the extent of RUS also has to be tuned.

124 Small Disjuncts Classes are constituted of sub-concepts, some of which are much smaller than others and are consequently underrepresented (Holte et al., 1989; Weiss, 2005). Often occurs in conjunction with Class Imbalance. Actual decision boundary Ideal decision boundary

125 Disjunctive Definitions while Concept Learning Given positive and negative examples of the concept "nice day", such a system might learn the following definition for nice day, which contains 2 disjuncts: (Temperature = "Warm" & Rain = FALSE) v (Temperature = "Hot" & Breeze = TRUE) A small disjunct is a disjunct that covers only a few training examples. A genuine problem, but usually ignored outside the decision tree literature. Leaf nodes with small number of training points are thought to represent small disjuncts. Special techniques such as 1-NN classification such nodes. are employed for

126 Class Skew The classes are characterized by highly dissimilar distributions Ideal decision boundary Actual decision boundary

127 A few examples Credit card fraud detection class imbalances, class skew. Breast cancer diagnosis class imbalance, class skew, small disjuncts. Market segmentation class imbalance, class skew. Facial and emotion recognition small disjuncts.

128 Feature based Irregularities Missing Features/Unstructured Missingness Equipment malfunction, Data loss, Transmission errors, etc. Absent Features/Structured Missingness not all features are defined for all data instances y x

129 Feature based Irregularities Types of Missingness (Little and Rubin, 1987): Missing Completely At Random MCAR (missingness does not depend on feature values) Missing At Random - MAR (missingness depends only on observed feature values) Missing Not At Random: Missingness depends only on missing features MNAR-I Missingness depends on both observed as well as unobserved feature values MNAR-II

130 A few more examples Survey data unstructured missingness Character recognition - unstructured missingness. Gene expression data structural missingness.

131 Feature based Irregularities Usual handling techniques: Marginalization Large fraction of data may be lost Imputation (Donders et al., 2006) Assumes data to be missing at random Not suitable if the missingness has a pattern to it: Only small value are missing (say) Estimates will be higher than the actual missing values

132 Maximum Margin Classifiers Perceptron can lead to many equally valid choices for the decision boundary Are these really equally valid? 132

133 Max Margin How can we pick which is best? Maximize the size of the margin. Small Margin LargeMargin Are these really equally valid? 133

134 134

135 Handling Class Imbalance in SVMs NBSVM A modification of SVM to reduce Bayes Error by combining Decision Boundary Shift with Cost Sensitivity Cost compensated SVM SVM decision boundary Mixed probability distribution for majority class Mixed probability distribution for minority class Cost compensated SVM + boundary shift Ideal decision boundary

136 Handling Class Imbalance in SVMs The Near-Bayesian SVM Assumptions: the margin of the classifier coincides with the region of overlap between the two classes similar rates of decay for the two classes within the margin 1 2 D D min w i i 2 i C i C s.t. yi wt.xi b i i C T yi w.xi b i i C S. Datta and S. Das, Near-Bayesian support vector machines for imbalanced data classification with equal or unequal misclassification costs, Neural Networks, vol. 70, pp , 2015.

137 Handling Class Imbalance in SVMs The unbalanced NBSVM If the minority class has a misclassification cost of P and the majority class has a misclassification cost of P min 1 2 D w 2 i C i D i C i P s.t. yi w xi b i i C P P T p P yi w xi b i p P p P T i C S. Datta and S. Das, Near-Bayesian support vector machines for imbalanced data classification with equal or unequal misclassification costs, Neural Networks, vol. 70, pp , 2015.

138 Handling Class Imbalance in SVMs NBSVM/uNBSVM (using SMO for solving the SVM Dual) Just some sample results: Dataset SVM (acc) German 72.11w Diabetes 76.47e Ionosphere 92.63w Car Eval w Vehicle 72.39w w significantly worse than NBSVM/uNBSVM e statistically equivalent to NBSVM/uNBSVM NBSVM (acc) Dataset SDC-SVM (gmean) unbsvm (gmean) Abalone w Breast Cn e Haberman 61.51w Liver 64.30w Car e gmeans TPR TNR acc (TP TN ) /(TP FP TN FN )

139 Handling Small Disjuncts in SVMs using Boosting Kernel Scaling K new ( x, y) D( x) K( x, y) D( y) where D( x) exp k kf tunable parameter 2 x Very small original margin Contour lines showing higher resolution induced close to the boundary to increase separability

140 Handling Small Disjuncts in SVMs using Boosting Asymmetric Kernel Scaling for handling imbalanced classification K new ( x, y) D( x) K( x, y) D( y) where D( x) exp k 2 D x) exp k f x ( k k Sparser contour lines i.e. higher resolution on minority side Denser contour lines i.e. lower resolution on majority side f 2 x

141 Handling Small Disjuncts in SVMs using Boosting Asymmetric Kernel Scaling questions How to choose proper value of k and? k Typically a costly 2D grid search is required How many iterations of perturbation must be undertaken? Observed to overfit within a few iterations when retrained on the perturbed kernel

142 Handling Small Disjuncts in SVMs using Boosting The main idea of Boosting Boosting can help overcome the overfitting But, SVMs, being strong learners, have low sensitivity to resampling Hence, SVMs are hard to boost in the traditional manner Can the diversity offered by kernel perturbation help in boosting?

143 Handling Small Disjuncts in SVMs using Boosting KP-Boost-SVM Retain higher resolution only around the misclassified points. For correctly classified points: k k k Retrain Greater immunity for small disjuncts! S. Datta, S. Nag, S. S. Mullick, S. Das, Diversifying Support Vector Machines for Boosting using Kernel Perturbation: Applications to Class Imbalance and Small Disjuncts, IEEE Transactions on Cybernetics (under review), 2017.

144 Handling Small Disjuncts in SVMs using Boosting KP-Boost-SVM Train initial SVM Specify a stepsize Start with k 1for all points REPEAT Ttimes Find the correctly classified points k k k for all correctly classified points END REPEAT Use final kvalues and k_test 1 = for testing k test Since most of the misclassification will be from the smaller disjuncts, the smaller disjuncts eventually develop higher resolution. Since most of the smaller disjuncts are from the minority class, class imbalance is also mitigated.

145 Handling Small Disjuncts in SVMs using Boosting KP-Boost-SVM Results: Dataset Index SVM AKS RUS- Boost Abalone 9vs18 KP- Boost- SVM gmean GSDI Car3 gmean GSDI H 1 significantly worse than KP- Boost-SVM H 0 statistically equivalent to KP- Boost-SVM CNAE9-2 MNIST2 vs17 Sign Rank Test gmean GSDI gmean GSDI gmean H 1 H 1 H 1 - GSDI H 1 H 1 H 1 -

146 Handling Missing Features in k-nn Traditional classifiers like k-nn cannot be directly applied Penalized Dissimilarity Measure = Euclidean distance over observed features + penalty for missing features

147 Handling Missing Features in k-nn 2 (x1, xi) (x1,2 - xi,2) 1 2 Original dataset: X {x (1,5), x2 (2,3), x3 Dataset with missingness: Imputation techniques: PDM: (x, x 1 X' {x' (*,5), x2 (2,3), x3 X X X ZI AI 1NNI {xˆ {xˆ {xˆ (3, 5), x (2, 3), x (3,6)} (3,6)} (0, 5), x2 (2, 3), x3 (3, 6)} (2.5, 5), x (2, 3), x (3, 6)} (3, 6)} Penalized Dissimilarity Measure (PDM): Let us formulate a simple PDM as ) 2 (5-3) 1/ (x , x2) (5-6) 1/ 2 1.5

148 Handling Missing Features in k-nn k-nn-fwpd Feature Weighted Penalty based Dissimilarity (FWPD) d fwpd where de x, y, p max d x y 1 p n j j x & j y x, y E Use to find the nearest j neighbours for k-nn classification d fwpd n j x, y S. Datta, D. Misra, and S. Das, A Feature Weighted Penalty based Dissimilarity Measure for k-nearest Neighbor Classification with Missing Features, Pattern Recognition Letters, Vol. 80, 2016.

149 Handling Missing Features in k-nn k-nn-fwpd Results (Avg. Rank on classification accuracies): Type of Missingness Zero Imputation Mean Imputation knn Imputation knn- FWPD MCAR MAR MNAR-I MNAR-II

150 Conclusions Traditional learning methods falter if the datasets are plagued by data irregularities. The discussed methods are designed so as to boil down to the traditional learning schemes in the absence of imbalance/small disjunct/feature disparity etc. The study of the effect of such data irregularities on large dimensional data (k >> n) may be an interesting field for future research. A multi-objective formulation of SVM for imbalanced classification can be investigated (optimize w and different regularization costs (D + and D- ) simultaneously to reach best trade-off solution).

151 Future Challenges for deep and shallow Machine Learning Imbalanced classification with missing features (Generalizing NBSVM with FWPD-based distance) Identify the novelty class/one target class (minority) Imbalanced classification of vision data with highly skewed class distributions (can we learn a deep representation?) (Huang et al., CVPR, 2016) Deep representation with SVM involved.for target minority class/novelty class (e.g. weapon detection in surveillance video?) (for primary result, see Tang, ICML, 2013) Enjoy best of both the worlds strong feature representation + strong classifier

152 152

Based on the original slides of Hung-yi Lee

Based on the original slides of Hung-yi Lee Based on the original slides of Hung-yi Lee Google Trends Deep learning obtains many exciting results. Can contribute to new Smart Services in the Context of the Internet of Things (IoT). IoT Services

More information

Based on the original slides of Hung-yi Lee

Based on the original slides of Hung-yi Lee Based on the original slides of Hung-yi Lee New Activation Function Rectified Linear Unit (ReLU) σ z a a = z Reason: 1. Fast to compute 2. Biological reason a = 0 [Xavier Glorot, AISTATS 11] [Andrew L.

More information

Deep Learning Tutorial. 李宏毅 Hung-yi Lee

Deep Learning Tutorial. 李宏毅 Hung-yi Lee Deep Learning Tutorial 李宏毅 Hung-yi Lee Outline Part I: Introduction of Deep Learning Part II: Why Deep? Part III: Tips for Training Deep Neural Network Part IV: Neural Network with Memory Part I: Introduction

More information

Tips for Deep Learning

Tips for Deep Learning Tips for Deep Learning Recipe of Deep Learning Step : define a set of function Step : goodness of function Step 3: pick the best function NO Overfitting! NO YES Good Results on Testing Data? YES Good Results

More information

Tips for Deep Learning

Tips for Deep Learning Tips for Deep Learning Recipe of Deep Learning Step : define a set of function Step : goodness of function Step 3: pick the best function NO Overfitting! NO YES Good Results on Testing Data? YES Good Results

More information

Deep Learning Tutorial. 李宏毅 Hung-yi Lee

Deep Learning Tutorial. 李宏毅 Hung-yi Lee Deep Learning Tutorial 李宏毅 Hung-yi Lee Deep learning attracts lots of attention. Google Trends Deep learning obtains many exciting results. The talks in this afternoon This talk will focus on the technical

More information

More Tips for Training Neural Network. Hung-yi Lee

More Tips for Training Neural Network. Hung-yi Lee More Tips for Training Neural Network Hung-yi ee Outline Activation Function Cost Function Data Preprocessing Training Generalization Review: Training Neural Network Neural network: f ; θ : input (vector)

More information

Deep learning attracts lots of attention.

Deep learning attracts lots of attention. Deep Learning Deep learning attracts lots of attention. I believe you have seen lots of exciting results before. Deep learning trends at Google. Source: SIGMOD/Jeff Dean Ups and downs of Deep Learning

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

Convolutional Neural Network. Hung-yi Lee

Convolutional Neural Network. Hung-yi Lee al Neural Network Hung-yi Lee Why CNN for Image? [Zeiler, M. D., ECCV 2014] x 1 x 2 Represented as pixels x N The most basic classifiers Use 1 st layer as module to build classifiers Use 2 nd layer as

More information

Introduction to Deep Neural Networks

Introduction to Deep Neural Networks Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic

More information

Introduction to Neural Networks

Introduction to Neural Networks CUONG TUAN NGUYEN SEIJI HOTTA MASAKI NAKAGAWA Tokyo University of Agriculture and Technology Copyright by Nguyen, Hotta and Nakagawa 1 Pattern classification Which category of an input? Example: Character

More information

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group Machine Learning for Computer Vision 8. Neural Networks and Deep Learning Vladimir Golkov Technical University of Munich Computer Vision Group INTRODUCTION Nonlinear Coordinate Transformation http://cs.stanford.edu/people/karpathy/convnetjs/

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Motivation Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Feature Design. Feature Design. Feature Design. & Deep Learning

Feature Design. Feature Design. Feature Design. & Deep Learning Artificial Intelligence and its applications Lecture 9 & Deep Learning Professor Daniel Yeung danyeung@ieee.org Dr. Patrick Chan patrickchan@ieee.org South China University of Technology, China Appropriately

More information

Neural Networks: Optimization & Regularization

Neural Networks: Optimization & Regularization Neural Networks: Optimization & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Opt & Reg

More information

Convolutional Neural Networks. Srikumar Ramalingam

Convolutional Neural Networks. Srikumar Ramalingam Convolutional Neural Networks Srikumar Ramalingam Reference Many of the slides are prepared using the following resources: neuralnetworksanddeeplearning.com (mainly Chapter 6) http://cs231n.github.io/convolutional-networks/

More information

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as

More information

Machine Learning Lecture 14

Machine Learning Lecture 14 Machine Learning Lecture 14 Tricks of the Trade 07.12.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory Probability

More information

Lecture 17: Neural Networks and Deep Learning

Lecture 17: Neural Networks and Deep Learning UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

More information

Neural networks and optimization

Neural networks and optimization Neural networks and optimization Nicolas Le Roux Criteo 18/05/15 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 1 / 85 1 Introduction 2 Deep networks 3 Optimization 4 Convolutional

More information

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections

More information

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Neural Networks. Nicholas Ruozzi University of Texas at Dallas Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify

More information

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation) Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation

More information

Jakub Hajic Artificial Intelligence Seminar I

Jakub Hajic Artificial Intelligence Seminar I Jakub Hajic Artificial Intelligence Seminar I. 11. 11. 2014 Outline Key concepts Deep Belief Networks Convolutional Neural Networks A couple of questions Convolution Perceptron Feedforward Neural Network

More information

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35 Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How

More information

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Hiroaki Hayashi 1,* Jayanth Koushik 1,* Graham Neubig 1 arxiv:1611.01505v3 [cs.lg] 11 Jun 2018 Abstract Adaptive

More information

IMPROVING STOCHASTIC GRADIENT DESCENT

IMPROVING STOCHASTIC GRADIENT DESCENT IMPROVING STOCHASTIC GRADIENT DESCENT WITH FEEDBACK Jayanth Koushik & Hiroaki Hayashi Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213, USA {jkoushik,hiroakih}@cs.cmu.edu

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Understanding Neural Networks : Part I

Understanding Neural Networks : Part I TensorFlow Workshop 2018 Understanding Neural Networks Part I : Artificial Neurons and Network Optimization Nick Winovich Department of Mathematics Purdue University July 2018 Outline 1 Neural Networks

More information

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu Convolutional Neural Networks II Slides from Dr. Vlad Morariu 1 Optimization Example of optimization progress while training a neural network. (Loss over mini-batches goes down over time.) 2 Learning rate

More information

Tutorial on Methods for Interpreting and Understanding Deep Neural Networks. Part 3: Applications & Discussion

Tutorial on Methods for Interpreting and Understanding Deep Neural Networks. Part 3: Applications & Discussion Tutorial on Methods for Interpreting and Understanding Deep Neural Networks W. Samek, G. Montavon, K.-R. Müller Part 3: Applications & Discussion ICASSP 2017 Tutorial W. Samek, G. Montavon & K.-R. Müller

More information

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS LAST TIME Intro to cudnn Deep neural nets using cublas and cudnn TODAY Building a better model for image classification Overfitting

More information

Maxout Networks. Hien Quoc Dang

Maxout Networks. Hien Quoc Dang Maxout Networks Hien Quoc Dang Outline Introduction Maxout Networks Description A Universal Approximator & Proof Experiments with Maxout Why does Maxout work? Conclusion 10/12/13 Hien Quoc Dang Machine

More information

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17 3/9/7 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/9/7 Perceptron as a neural

More information

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Steve Renals Machine Learning Practical MLP Lecture 5 16 October 2018 MLP Lecture 5 / 16 October 2018 Deep Neural Networks

More information

Final Examination CS 540-2: Introduction to Artificial Intelligence

Final Examination CS 540-2: Introduction to Artificial Intelligence Final Examination CS 540-2: Introduction to Artificial Intelligence May 7, 2017 LAST NAME: SOLUTIONS FIRST NAME: Problem Score Max Score 1 14 2 10 3 6 4 10 5 11 6 9 7 8 9 10 8 12 12 8 Total 100 1 of 11

More information

Nonlinear Classification

Nonlinear Classification Nonlinear Classification INFO-4604, Applied Machine Learning University of Colorado Boulder October 5-10, 2017 Prof. Michael Paul Linear Classification Most classifiers we ve seen use linear functions

More information

Deep Learning. Hung-yi Lee 李宏毅

Deep Learning. Hung-yi Lee 李宏毅 Deep Learning Hung-yi Lee 李宏毅 Deep learning attracts lots of attention. I believe you have seen lots of exciting results before. Deep learning trends at Google. Source: SIGMOD 206/Jeff Dean 958: Perceptron

More information

OPTIMIZATION METHODS IN DEEP LEARNING

OPTIMIZATION METHODS IN DEEP LEARNING Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate

More information

Normalization Techniques in Training of Deep Neural Networks

Normalization Techniques in Training of Deep Neural Networks Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th,

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline

More information

Deep Learning for NLP

Deep Learning for NLP Deep Learning for NLP CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio) Machine Learning and NLP NER WordNet Usually machine learning

More information

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Neural Networks 2. 2 Receptive fields and dealing with image inputs CS 446 Machine Learning Fall 2016 Oct 04, 2016 Neural Networks 2 Professor: Dan Roth Scribe: C. Cheng, C. Cervantes Overview Convolutional Neural Networks Recurrent Neural Networks 1 Introduction There

More information

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1) 11/3/15 Machine Learning and NLP Deep Learning for NLP Usually machine learning works well because of human-designed representations and input features CS224N WordNet SRL Parser Machine learning becomes

More information

Neural networks and support vector machines

Neural networks and support vector machines Neural netorks and support vector machines Perceptron Input x 1 Weights 1 x 2 x 3... x D 2 3 D Output: sgn( x + b) Can incorporate bias as component of the eight vector by alays including a feature ith

More information

Deep Learning. Convolutional Neural Network (CNNs) Ali Ghodsi. October 30, Slides are partially based on Book in preparation, Deep Learning

Deep Learning. Convolutional Neural Network (CNNs) Ali Ghodsi. October 30, Slides are partially based on Book in preparation, Deep Learning Convolutional Neural Network (CNNs) University of Waterloo October 30, 2015 Slides are partially based on Book in preparation, by Bengio, Goodfellow, and Aaron Courville, 2015 Convolutional Networks Convolutional

More information

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 Machine Learning for Signal Processing Neural Networks Continue Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 1 So what are neural networks?? Voice signal N.Net Transcription Image N.Net Text

More information

Course 395: Machine Learning - Lectures

Course 395: Machine Learning - Lectures Course 395: Machine Learning - Lectures Lecture 1-2: Concept Learning (M. Pantic) Lecture 3-4: Decision Trees & CBC Intro (M. Pantic & S. Petridis) Lecture 5-6: Evaluating Hypotheses (S. Petridis) Lecture

More information

Understanding How ConvNets See

Understanding How ConvNets See Understanding How ConvNets See Slides from Andrej Karpathy Springerberg et al, Striving for Simplicity: The All Convolutional Net (ICLR 2015 workshops) CSC321: Intro to Machine Learning and Neural Networks,

More information

Local Affine Approximators for Improving Knowledge Transfer

Local Affine Approximators for Improving Knowledge Transfer Local Affine Approximators for Improving Knowledge Transfer Suraj Srinivas & François Fleuret Idiap Research Institute and EPFL {suraj.srinivas, francois.fleuret}@idiap.ch Abstract The Jacobian of a neural

More information

Deep Learning for Speech Recognition. Hung-yi Lee

Deep Learning for Speech Recognition. Hung-yi Lee Deep Learning for Speech Recognition Hung-yi Lee Outline Conventional Speech Recognition How to use Deep Learning in acoustic modeling? Why Deep Learning? Speaker Adaptation Multi-task Deep Learning New

More information

Machine Learning. Neural Networks

Machine Learning. Neural Networks Machine Learning Neural Networks Bryan Pardo, Northwestern University, Machine Learning EECS 349 Fall 2007 Biological Analogy Bryan Pardo, Northwestern University, Machine Learning EECS 349 Fall 2007 THE

More information

DeepLearning on FPGAs

DeepLearning on FPGAs DeepLearning on FPGAs Introduction to Deep Learning Sebastian Buschäger Technische Universität Dortmund - Fakultät Informatik - Lehrstuhl 8 October 21, 2017 1 Recap Computer Science Approach Technical

More information

Introduction to Deep Learning

Introduction to Deep Learning Introduction to Deep Learning Some slides and images are taken from: David Wolfe Corne Wikipedia Geoffrey A. Hinton https://www.macs.hw.ac.uk/~dwcorne/teaching/introdl.ppt Feedforward networks for function

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Neural networks Daniel Hennes 21.01.2018 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Logistic regression Neural networks Perceptron

More information

Final Examination CS540-2: Introduction to Artificial Intelligence

Final Examination CS540-2: Introduction to Artificial Intelligence Final Examination CS540-2: Introduction to Artificial Intelligence May 9, 2018 LAST NAME: SOLUTIONS FIRST NAME: Directions 1. This exam contains 33 questions worth a total of 100 points 2. Fill in your

More information

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered

More information

CS 1674: Intro to Computer Vision. Final Review. Prof. Adriana Kovashka University of Pittsburgh December 7, 2016

CS 1674: Intro to Computer Vision. Final Review. Prof. Adriana Kovashka University of Pittsburgh December 7, 2016 CS 1674: Intro to Computer Vision Final Review Prof. Adriana Kovashka University of Pittsburgh December 7, 2016 Final info Format: multiple-choice, true/false, fill in the blank, short answers, apply an

More information

Convolutional Neural Networks

Convolutional Neural Networks Convolutional Neural Networks Books» http://www.deeplearningbook.org/ Books http://neuralnetworksanddeeplearning.com/.org/ reviews» http://www.deeplearningbook.org/contents/linear_algebra.html» http://www.deeplearningbook.org/contents/prob.html»

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

Neural Network Tutorial & Application in Nuclear Physics. Weiguang Jiang ( 蒋炜光 ) UTK / ORNL

Neural Network Tutorial & Application in Nuclear Physics. Weiguang Jiang ( 蒋炜光 ) UTK / ORNL Neural Network Tutorial & Application in Nuclear Physics Weiguang Jiang ( 蒋炜光 ) UTK / ORNL Machine Learning Logistic Regression Gaussian Processes Neural Network Support vector machine Random Forest Genetic

More information

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow, Index A Activation functions, neuron/perceptron binary threshold activation function, 102 103 linear activation function, 102 rectified linear unit, 106 sigmoid activation function, 103 104 SoftMax activation

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

Lecture 5 Neural models for NLP

Lecture 5 Neural models for NLP CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab

More information

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning Lei Lei Ruoxuan Xiong December 16, 2017 1 Introduction Deep Neural Network

More information

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions? Online Videos FERPA Sign waiver or sit on the sides or in the back Off camera question time before and after lecture Questions? Lecture 1, Slide 1 CS224d Deep NLP Lecture 4: Word Window Classification

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning Lecture 9 Numerical optimization and deep learning Niklas Wahlström Division of Systems and Control Department of Information Technology Uppsala University niklas.wahlstrom@it.uu.se

More information

MIRA, SVM, k-nn. Lirong Xia

MIRA, SVM, k-nn. Lirong Xia MIRA, SVM, k-nn Lirong Xia Linear Classifiers (perceptrons) Inputs are feature values Each feature has a weight Sum is the activation activation w If the activation is: Positive: output +1 Negative, output

More information

arxiv: v1 [cs.lg] 11 May 2015

arxiv: v1 [cs.lg] 11 May 2015 Improving neural networks with bunches of neurons modeled by Kumaraswamy units: Preliminary study Jakub M. Tomczak JAKUB.TOMCZAK@PWR.EDU.PL Wrocław University of Technology, wybrzeże Wyspiańskiego 7, 5-37,

More information

Neural networks and optimization

Neural networks and optimization Neural networks and optimization Nicolas Le Roux INRIA 8 Nov 2011 Nicolas Le Roux (INRIA) Neural networks and optimization 8 Nov 2011 1 / 80 1 Introduction 2 Linear classifier 3 Convolutional neural networks

More information

Supervised Learning. George Konidaris

Supervised Learning. George Konidaris Supervised Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom Mitchell,

More information

CSC321 Lecture 15: Exploding and Vanishing Gradients

CSC321 Lecture 15: Exploding and Vanishing Gradients CSC321 Lecture 15: Exploding and Vanishing Gradients Roger Grosse Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 1 / 23 Overview Yesterday, we saw how to compute the gradient descent

More information

Deep Learning (CNNs)

Deep Learning (CNNs) 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Deep Learning (CNNs) Deep Learning Readings: Murphy 28 Bishop - - HTF - - Mitchell

More information

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline Other Measures 1 / 52 sscott@cse.unl.edu learning can generally be distilled to an optimization problem Choose a classifier (function, hypothesis) from a set of functions that minimizes an objective function

More information

Artificial Neural Networks Examination, June 2005

Artificial Neural Networks Examination, June 2005 Artificial Neural Networks Examination, June 2005 Instructions There are SIXTY questions. (The pass mark is 30 out of 60). For each question, please select a maximum of ONE of the given answers (either

More information

Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures. Lecture 04

Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures. Lecture 04 Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures Lecture 04 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Self-Taught Learning 1. Learn

More information

Neural Networks and Deep Learning.

Neural Networks and Deep Learning. Neural Networks and Deep Learning www.cs.wisc.edu/~dpage/cs760/ 1 Goals for the lecture you should understand the following concepts perceptrons the perceptron training rule linear separability hidden

More information

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression

More information

Demystifying deep learning. Artificial Intelligence Group Department of Computer Science and Technology, University of Cambridge, UK

Demystifying deep learning. Artificial Intelligence Group Department of Computer Science and Technology, University of Cambridge, UK Demystifying deep learning Petar Veličković Artificial Intelligence Group Department of Computer Science and Technology, University of Cambridge, UK London Data Science Summit 20 October 2017 Introduction

More information

Importance Reweighting Using Adversarial-Collaborative Training

Importance Reweighting Using Adversarial-Collaborative Training Importance Reweighting Using Adversarial-Collaborative Training Yifan Wu yw4@andrew.cmu.edu Tianshu Ren tren@andrew.cmu.edu Lidan Mu lmu@andrew.cmu.edu Abstract We consider the problem of reweighting a

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

Neural Networks biological neuron artificial neuron 1

Neural Networks biological neuron artificial neuron 1 Neural Networks biological neuron artificial neuron 1 A two-layer neural network Output layer (activation represents classification) Weighted connections Hidden layer ( internal representation ) Input

More information

Introduction to Deep Learning CMPT 733. Steven Bergner

Introduction to Deep Learning CMPT 733. Steven Bergner Introduction to Deep Learning CMPT 733 Steven Bergner Overview Renaissance of artificial neural networks Representation learning vs feature engineering Background Linear Algebra, Optimization Regularization

More information

CSE 591: Introduction to Deep Learning in Visual Computing. - Parag S. Chandakkar - Instructors: Dr. Baoxin Li and Ragav Venkatesan

CSE 591: Introduction to Deep Learning in Visual Computing. - Parag S. Chandakkar - Instructors: Dr. Baoxin Li and Ragav Venkatesan CSE 591: Introduction to Deep Learning in Visual Computing - Parag S. Chandakkar - Instructors: Dr. Baoxin Li and Ragav Venkatesan Overview Background Why another network structure? Vanishing and exploding

More information

Deep Learning Lecture 2

Deep Learning Lecture 2 Fall 2016 Machine Learning CMPSCI 689 Deep Learning Lecture 2 Sridhar Mahadevan Autonomous Learning Lab UMass Amherst COLLEGE Outline of lecture New type of units Convolutional units, Rectified linear

More information

Slide credit from Hung-Yi Lee & Richard Socher

Slide credit from Hung-Yi Lee & Richard Socher Slide credit from Hung-Yi Lee & Richard Socher 1 Review Recurrent Neural Network 2 Recurrent Neural Network Idea: condition the neural network on all previous words and tie the weights at each time step

More information

Final Exam, Fall 2002

Final Exam, Fall 2002 15-781 Final Exam, Fall 22 1. Write your name and your andrew email address below. Name: Andrew ID: 2. There should be 17 pages in this exam (excluding this cover sheet). 3. If you need more room to work

More information

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation

More information