Tutorial on Machine Learning for Advanced Electronics

Similar documents
Recap from previous lecture

Deep Learning Lab Course 2017 (Deep Learning Practical)

CS 6375 Machine Learning

Bits of Machine Learning Part 1: Supervised Learning

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Linear Models for Regression. Sargur Srihari

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

A summary of Deep Learning without Poor Local Minima

Lecture 2 Machine Learning Review

Machine Learning Lecture 7

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning Lecture 5

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

STA 414/2104: Lecture 8

Introduction to Machine Learning

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Reading Group on Deep Learning Session 1

CSC 411 Lecture 10: Neural Networks

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Linear & nonlinear classifiers

We choose parameter values that will minimize the difference between the model outputs & the true function values.

Introduction to Machine Learning

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Machine Learning Basics III

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning Linear Models

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Introduction to Machine Learning

Neural Networks in Structured Prediction. November 17, 2015

MODULE -4 BAYEIAN LEARNING

Deep Learning for NLP

Linear & nonlinear classifiers

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Notes on Back Propagation in 4 Lines

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Lecture 3: Introduction to Complexity Regularization

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Overfitting, Bias / Variance Analysis

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Feedforward Neural Networks

ECE521 Lecture 7/8. Logistic Regression

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

CSC321 Lecture 5: Multilayer Perceptrons

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

Logistic Regression & Neural Networks

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Artificial Neural Network

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Advanced computational methods X Selected Topics: SGD

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Optimization Methods for Machine Learning (OMML)

STA 414/2104: Lecture 8

Generalization theory

Introduction to Neural Networks

Lecture 5 Neural models for NLP

CSC242: Intro to AI. Lecture 21

PATTERN CLASSIFICATION

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

6.036 midterm review. Wednesday, March 18, 15

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Advanced Machine Learning

Machine Learning and Data Mining. Linear classification. Kalev Kask

Introduction to Support Vector Machines

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Neural Networks and Deep Learning

Artificial Neural Networks

Neuroscience Introduction

Introduction to Neural Networks

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Neural Network Training

Introduction to (Convolutional) Neural Networks

CSC321 Lecture 8: Optimization

CSC321 Lecture 9: Generalization

Week 5: Logistic Regression & Neural Networks

Linear Models for Regression

Lecture 17: Neural Networks and Deep Learning

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

CSC321 Lecture 2: Linear Regression

ESS2222. Lecture 4 Linear model

Pattern Recognition and Machine Learning

Importance Reweighting Using Adversarial-Collaborative Training

Lecture 5: Logistic Regression. Neural Networks

Deep Feedforward Networks

CMU-Q Lecture 24:

Machine Learning. Boris

Support Vector Machines for Classification: A Statistical Portrait

Regularization via Spectral Filtering

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Computational and Statistical Learning Theory

Transcription:

Tutorial on Machine Learning for Advanced Electronics Maxim Raginsky March 2017

Part I (Some) Theory and Principles

Machine Learning: estimation of dependencies from empirical data (V. Vapnik) enabling a computer to perform well on a given task without explicitly programming the task improving performance on a task based on experience

Statistical Learning Paradigm Y }{{} output = f }{{} ( ) }{{} X input Experience comes in the form of training data (X 1, Y 1 ), (X 2, Y 2 ),..., (X n, Y n ) sampled at random from the phenomenon of interest unknown distribution p(x, y) Performance is measured by expected accuracy of inference (prediction on previously unseen instances): R(f) = E[l(f(X), Y )] = l(f(x), y) }{{} prediction error p(x, y) }{{} data distribution dx dy Improvement comes from seeing more data before settling on a hypothesis

Goals of Learning (X 1, Y 1 ),..., (X n, Y n ) }{{} training data Predictive consistency algorithm R( f n ) n min f F R(f) f n }{{} hypothesis eventually, learn to predict as well as the best predictor you could find with full knowledge of p(x, y) Generalization 1 n n l( f n (X i ), Y i ) R( f n ) i=1 performance of f n on the training data is a good predictor of its performance on previously unseen instances Efficiency minimize resource usage (data, computation time, memory,...)

Learning and Inference Pipeline Learning and inference pipeline Learning Training Samples Features Training Labels Training Learned model Inference Learned model Features Predic'on Test Sample Image credit: Svetlana Lazebnik

Data Requirements Predictive consistency: for given ε, δ > 0, what is the smallest n, such that with probability 1 δ? R( f n ) min f F R(f) ε Generalization: for given ε, δ > 0, what is the smallest n, such that 1 n l( n f n (X i ), Y i ) R( f n ) ε i=1 with probability 1 δ? Efficiency the golden standard is n(ε, δ) 1 ε 2 log 1 δ from P[error diff. ε] c 1 e c 2nε 2

Underfitting and Overfitting Underfitting: not enough data R( f n ) min R(f) is large f F Overfitting: low training error but large testing error R( f n ) 1 n l( n f n (X i ), Y i ) is large i=1 High Bias Low Variance Low Bias High Variance Prediction Error ts Test Sample Training Sample Low Model Complexity High Image credit: Hastie et al., Elements of Statistical Learning

Generative and Discriminative Models (X 1, Y 1 ),..., (X n, Y n ) }{{} training data algorithm f n }{{} hypothesis Recall: distribution p(x, y) is unknown Generative modeling: use the data to estimate p, then find the best predictor for p n : (X 1, Y 1 ),..., (X n, Y n ) p n (x, y) f n terminology: generative model explicitly models how input x and the output y are generated advantages: pn can be used for other tasks besides prediction (e.g., exploring the data space) disadvantages: inflated sample complexity, especially in high dimension

Generative and Discriminative Models (X 1, Y 1 ),..., (X n, Y n ) }{{} training data algorithm f n }{{} hypothesis Recall: distribution p(x, y) is unknown Discriminative modeling: focus on prediction only (X 1, Y 1 ),..., (X n, Y n ) f n terminology: discriminative model (implicitly) models the distribution of y for each fixed x, allowing to discriminate between different values of y advantages: smaller data requirements disadvantages: narrowly focused; will need retraining if the task changes

Part II Canonical ML Tasks and Algorithms

Supervised and Unsupervised Learning Prediction is an example of supervised learning: training data consist of input instances matched (or labeled) with output instances to guide the learning process Many learning tasks are unsupervised: no labels provided, and the goal is to learn some sort of structure in the data Performance criteria are more vague and subjective than in supervised learning Unsupervised learning is often an initial step in a more complicated learning and inference pipeline

Unsupervised Learning Clustering: Unsupervised discoverlearning groups of similar instances Clustering Discover groups of similar data points Manifold learning: discover low-dimensional structures Density estimation: approximate probability density of data Unsupervised learning is often used at early stages of ML pipeline. Image credits: Svetlana Lazebnik

Supervised Learning: Classification (X, Y ) p input/output Classification: assign attributes X to one of M classes Example: email spam detection Criterion: probability of misclassification R(f) = P[f(X) Y ], so that l(y, f(x)) = { 0, if f(x) = Y 1, if f(x) Y

Consider the following setting. Let Supervised Learning: Regression (X, Y ) p Y = f (X) + W, where X is a random variable (r.v.) on X = [0, 1], W is a r.v. on Y = R, independent of X input/output Finally let f : [0, 1] R be a function satisfying E[W ] = 0 and E[W 2 ] = σ 2 <. f (t) f (s) L t s, t, s [0, 1], Regression: predict a continuous output Y given X Example: curve fitting from noisy data where L > 0 is a constant. A function satisfying condition (1) is said to be Lipschitz on [0, such a function must be continuous, but it is not necessarily differentiable. An example of is depicted in Figure 1(a). 1.5 1.4 1.2 1 0.5 1 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 (a) 0 0 0.2 0.4 0.6 0.8 (b) Figure 1: Example of a Lipschitz function, and our observations setting. (a) random samp points correspond to (X i, Y i ), i = 1,..., n; (b) deterministic sampling of f, the points (i/n, Y i ), i = 1,..., n. Criterion: mean squared error R(f) = E[(f(X) Y ) 2 ], so that l(y, f(x)) = (f(x) Y ) 2 Note that

Example 1: Kernel Methods Training data: (X 1, Y 1 ),..., (X n, Y n ) Look for f n of the form n f c (x) = c i K(x X i ) i=1 Coefficients c 1,..., c n chosen to minimize n (f c (X i ) Y i ) 2 i=1 (subject to additional constraints) K(x x ): potential field due to unit charge at x Advantage: easy to optimize Disadvantage: need to store c and X i s to do inference Image credit: Aizerman, Braverman, Rozonoer

Example 2: Neural Networks and Deep Learning x 1 w 1 x 2 x d. w 2 w d 1 b y = f(w 1 x 1 +... + w d x d + b) where: w 1,..., w d tunable gains b tunable bias f( ) activation function f y Build up complicated functions by composing simple units Examples of activation functions: { 1, z 0 threshold, f(z) = 0, z < 0 sigmoid, f(z) = 1 1+e z tanh, f(z) = ez e z e z +e z rectified linear unit (ReLU), f(z) = max(z, 0) Advantage: neural networks are universal approximators for smooth functions y = f(x)

Example 2: Neural Networks and Deep Learning z 1 = f(a 1 x + b 1 ), z 2 = f(a 2 z 1 + b 2 ),..., y = f(a k z k 1 + b k ) Advantages: deep networks can efficiently represent rich classes of functions; work extremely well in practice Disadvantages: training is computationally intensive; poorly understood theoretically

Part III Interaction between Data and Algorithms

Recap: Learning and Inference Pipeline Learning and inference pipeline Learning Training Samples Features Training Labels Training Learned model Inference Learned model Features Predic'on Test Sample What happens before the training step?

Feature Extraction and Dimensionality Reduction Which features to use? traditionally, this step relied on extensive expert knowledge hand-engineered features tailored to the task at hand deep learning aims to automate this task (think of each layer performing feature extraction) Dimensionality reduction use unsupervised learning to discover low-dimensional structure in the data for example, principal component analysis (PCA)

Principal Component Analysis (PCA) compute sample covariance matrix: Σ n = 1 n n (X i µ n )(X i µ n ) T, µ n = 1 n X i n i=1 }{{} sample mean i=1 eigendecomposition: Σ n = UDU T retain top k eigenvectors Image source: MathWorks

What To Do If You Don t Have Enough Data? Bootstrap Training data: (X 1, Y 1 ),..., (X n, Y n ) If n is small, there is risk of underfitting Bootstrap is a resampling method: for j = 1 to m end for draw I(j) uniformly at random from {1,..., n} let Z j = (X I(j), Y I(j) ) return Z 1, Z 2,..., Z m Bootstrap can be used to get a good estimate of error bars for a chosen class of models

Gradient Descent and Its Relatives Training is an optimization problem: minimize L(w) = 1 n n (Y i f(w, X i )) 2 e.g., x f(w, x) is a deep neural network, so w is a (very) high-dimensional vector i=1 Gradient descent w t+1 = w t η t L(w t ), η t learning rate Backpropagation: use chain rule to compute gradients efficiently Computing L(w) using backprop takes O(nkm) operations, where k is the number of layers and m is the number of connections per layer Still computationally expensive, requires full pass through training data at each iteration

Stochastic Gradient Descent Gradient descent w t+1 = w t η t L(w t ), η t learning rate L(w) = 1 n w (Y i f(w, X i )) 2 n i=1 Stochastic gradient with mini-batches: essentially, bootstrap for t = 1, 2,... draw I(1, t),..., I(b, t) with replacement from {1,..., n} w t+1 = w t η t 1 b w l(y I(j,t) f(w t, X I(j,t) )) 2 b end for j=1 Computational-statistical trade-offs: each iteration takes O(bkm) operations instead of O(nkm) gradient error is now 1/b instead of 1/n

Part IV Machine Learning for EDA

Main Challenges for EDA Training data are waveforms X = (X(t)) 0 t T, Y = (Y (t)) 0 t T e.g., X is voltage, Y is current Many different types of models/analysis: steady-state transient small-signal... How do we represent waveform data? sampling rate affects quality of learned models different features are useful for different analyses (rise/hold time; frequency/phase; duration;...) How can we optimize measurement collection process?

Recurrent Neural Networks y 1 y 2 y 3 y T 0 1 f f 2 3 f... T 1 f T x 1 x 2 x 3 x T Nonlinear state-space model: ξ t = g(aξ t 1 + Bx t ) y t = h(cξ t + Dx t ) Physics-based models of nonlinear electronic devices are state-space models RNNs are universal approximators for such systems States (internal variables) can be discovered automatically, without hand tuning

Learning RNN Models of Nonlinear Devices y 1 y 2 y 3 y T 0 1 f f 2 3 f... T 1 f T x 1 x 2 x 3 x T Data: sampled waveforms (τ - sampling period, T - # samples) X i = (X i (0), X i (τ),..., X i (T τ)) Y i = (Y i (0), Y i (τ),..., Y i (T τ)) i = 1,..., n RNN is a T -layer NN, with parameters shared by all neurons Training by SGD with Backpropagation Through Time (BPTT) Output: weight matrices A, B, C, D; discrete-time approximation ξ k+1 = g(aξ k + Bx k ) y k = h(cξ k + Dx k )

Main Challenges Which stimuli to use? The learned model may not be equally suitable for different types of analysis (transient vs. steady-state) The learned model should be stable, plug-and-play with Verilog-A Enforcing physical constraints: zero in/zero out, symmetry, polarity, etc. Usual issues with BPTT: vanishing/exploding gradients, etc.

Capturing Variability: Generative Modeling Device model: y = g(x, β), where x is input, y is output, β are device parameters. Variability: β is a random quantity which varies device to device, yet cannot be sampled directly Data: sample input-ouptut behavior from many devices of the same type Learn a generative model for (x, y) use deep NN to learn complicated functions mapping explicit design parameters to latent variability factors maximum likelihood via SGD with backprop bootstrap can be used to compensate for lack of data Disadvantage: training takes time Advantage: once learned, model is easy to sample from, can be used to explore variability, 3σ typical behavior, etc.

Compositional ML Training and inference/simulation pipeline: { } training ξ = f(ξ, x) Verilog-A data simulation y = g(ξ, x) Theoretical and practical challenge: optimize end-to-end performance The learned model should be internally stable enforce stability during training via appropriate penalty functions The learned model should be ready for Verilog-A RNN outputs a discrete-time model Verilog-A uses adaptive time discretization to simulate behavior need deep understanding of numerical methods for ODEs to ensure end-to-end stability and robustness

For more questions: maxim@illinois.edu