Machine Learning Instructor: Pranjal Awasthi
Course Info Requested an SPN and emailed me Wait for Carol Difrancesco to give them out. Not registered and need SPN Email me after class No promises It s a large class so won t allow to sit in without registering. Sorry
Course Staff Instructor: Pranjal Awasthi (pranjal.awasthi@rutgers.edu) Research Interests: Semi supervised learning Clustering Online learning Learning theory Office hours: Monday 2-3pm. Hill 448. Course website: www.rci.rutgers.edu/~pa336/ml_s16.html
Course Staff TA: Yan Zhu (yanzhu.cbim@cs.rutgers.edu) Research Interests: Large scale machine learning Deep learning Computer vision Office hours: Friday 11-12am. CBIM.
Course Info No required textbook Recommended
Course Info ~ 5 Homeworks (40%) In class midterm (30%) (March 10 no makeup exam) Final project (30%) Zero tolerance for cheating Academic integrity policy Grading: [90-100] [85-89] [80-84] [75-79] [70-74] [< 70] A B+ B C+ C F
Homework Policy ~2 weeks/hw. Submit via sakai. Should be typeset in LaTex. See website. Late homeworks not accepted. No regrading policy. TA is the boss. Encouraged to discuss Write solution in your own words Write names of people you discussed with Start Early!
Typically two parts Conceptual/Analytical Programming Homework Policy Conceptual justify your solution, rigorous proofs when asked for. Aim to test fundamentals. Programming Matlab for homeworks. Justify your findings. Submit code. Well documented. Make sure the code runs. HW0 up on the webpage no need to submit
A word about the course The course is designed to be tough More theoretical than previous courses. Should be comfortable with basic probability, linear algebra, algorithms. If cannot do HW0, consider dropping the course. How can I do well? Come to lectures, ask questions. Take notes. Play around with data and methods.
What is Machine Learning? Statistics Computer Science The science of making sound inference and predictions from data. The study of algorithms that improve performance on a given task with time and experience. The part of AI that is actually useful!
History of Machine Learning? Pre 1950 s Statistics and probability theory 1950-80 s The AI phase Post 90 s modern machine learning
Pre 1950 s Collection and analysis of data has always been around Traditionally for governance and politics 1500 s collection of data on deaths, marriages, baptisms in England and France. Analyzed by humans. Not very scientific. 1700 s: Probability theory became a big tool Lot of work on studying gambling.
Pre 1950 s Pearson: Analyzed crab population near Naples Wanted to understand the nature of the population. Claimed that there are two underlying species. Statistical Modeling
Pre 1950 s I can taste and tell whether tea was added first or the milk. Hmm how do i verify that? Experimental Design
Pre 1950 s Lots of fundamental questions that are still relevant How to design an experiment? How to collect data and do survey/polls? How to choose between different hypothesis? Understand hidden structure in the data?
Post 1950 s CS enters AI is coined Can intelligent machines be built? Turing test
Post 1950 s 1952: Program for checkers
60 s: ELIZA Post 1950 s
Post 1950 s 70 s: MYCIN for medical diagnosis Knowledge base of ~600 rules Most machine learning systems we rule based or knowledge based Limitations quickly became clear
Post 90 s Statistical machine learning Data driven algorithm design
Modern ML ML algorithm
Modern ML ML algorithm
What you ll learn in this course ML algorithm Support vector machines, Naïve Bayes, Logistic regression, linear regression, Decision trees, Boosting, Graphical models, Reinforcement learning, Deep learning, Model selection, Optimization, Kernel methods, Learning theory, Bayesian methods, Semi supervised learning.
Probability Overview Random variable X a map from a set Ω to R Ω equipped with probability P. P X A = P(ω Ω: X ω A) X has distribution P, denoted as X P.
Probability Overview Cumulative Density Function(cdf) F X x = P(X x) If X is discrete probability mass function (pmf), p(x) P X = x = p(x) If X is continuous probability density function (pdf), p(x) P X A = A p(x) dx
Probability Overview Expected value of X E X = x p x dx (continuous) E X = x p x (discrete) Variance of X Var X = E X E X 2 = E X 2 E[X] 2
Probability Overview Independence: X and Y are independent iff P X A, Y B = P X A P Y B, A, B Covariance between X and Y Cov X, Y = E[(X E[X])(Y E[Y])] If X and Y are independent then Cov X, Y = 0 Var X + Y = Var X + Var Y
Probability Overview Conditional distribution Distribution of X conditioned on Y = y pdf: p x y = p x,y p(y) Joint distribution of X and Y pdf: p(x, y) marginal density of x, p x = y p x, y dy
Probability Inequalities Markov s inequality If X > 0, P X > te X 1 t Chebychev s inequality P X E X tvar X 1 t 2
Probability Inequalities Let X 1, X 2, X n be independent and identically distributed(i.i.d.), taking values in {0,1}. E X i = μ X n = i X i n Chernoff bound: For δ [0,1], P X n > μ 1 + δ e nμδ2 3 P X n < μ 1 δ e nμδ2 2
Probability Inequalities Let X 1, X 2, X n be independent and identically distributed(i.i.d.), taking values in {0,1}. E X i = μ X n = i X i n Hoeffding bound: For δ [0,1], P X n > μ + δ e 2nδ2 P X n < μ δ e 2nδ2
Onto new content
Point Estimation Goal: Estimate the bias of a coin. Why?? I came here to master deep learning
Point Estimation Given a coin Comes up heads(1) with probability p. Comes up tails(0) with probability 1 p. Estimate p? Your idea: toss it a few times and see.. What is the estimate? How many flips needed?
Point Estimation A random variable X distributed according to D(θ) Given i.i.d. samples Goal: Estimate θ from D
Point Estimation Given a coin Comes up heads(1) with probability p. Comes up tails(0) with probability 1 p. Estimate p? Your idea: toss it a few times and see.. What is the estimate? How many flips needed?
Point Estimation A random variable X distributed according to D(θ) Given i.i.d. samples Goal: Estimate θ from D Three methods Method of moments(mom) Maximum Likelihood Estimation(MLE) Bayesian Estimation ( )
Given a coin Method of Moments Comes up heads(1) with probability p. Comes up tails(0) with probability 1 p. Estimate p? Idea: match observed distribution to true distribution Moments: an elegant way to achieve this.
Method of Moments A random variable X distributed according to D(θ) Given i.i.d. samples from D Goal: Estimate θ Moments of X
Given a coin Method of Moments Comes up heads(1) with probability p. Comes up tails(0) with probability 1 p. Estimate p? Idea: match observed distribution to true distribution Moments: an elegant way to achieve this.
Given a coin Method of Moments Comes up heads(1) with probability p. Comes up tails(0) with probability 1 p. Estimate p? Idea: match observed distribution to true distribution Moments: an elegant way to achieve this. All moments of our distribution are p. What about moments of the observed data?
Given a coin Method of Moments Comes up heads(1) with probability p. Comes up tails(0) with probability 1 p. Estimate p? Idea: match observed distribution to true distribution Moments: an elegant way to achieve this. All moments of our distribution are p. What about moments of the observed data?
Given a coin Method of Moments Comes up heads(1) with probability p. Comes up tails(0) with probability 1 p. Estimate p? Idea: match observed distribution to true distribution Moments: an elegant way to achieve this. All moments of our distribution are p. All moments of the observed data are
Given a coin Method of Moments Comes up heads(1) with probability p. Comes up tails(0) with probability 1 p. Estimate p? How good is the estimate? How many samples(n) do we need?
Method of Moments How good is the estimate? Need a notion of error Mean squared error(mse)
Method of Moments How good is the estimate? Need a notion of error Mean squared error(mse)
Method of Moments
Method of Moments
Method of Moments How good is the estimate? Need a notion of error Mean squared error(mse)
Method of Moments How good is the estimate? Need a notion of error Mean squared error(mse) How many samples?
Method of Moments How good is the estimate? Need a notion of error Mean squared error(mse) How many samples?
Point Estimation A random variable X distributed according to D(θ) Given i.i.d. samples Goal: Estimate θ Your estimate:θ( ) from D Is MSE always equal to
Point Estimation A random variable X distributed according to D(θ) Given i.i.d. samples Goal: Estimate θ Your estimate:θ( ) from D
Point Estimation
Given a coin Method of Moments Comes up heads(1) with probability p. Comes up tails(0) with probability 1 p. Estimate p? Idea: match observed distribution to true distribution Moments: an elegant way to achieve this. A natural approach Matching moments = solving system of equations Equations get messy pretty soon! Limited algorithmic tools, limited theory what does the optimal classifier look like?
Maximum Likelihood Estimation Given a coin Comes up heads(1) with probability p. Comes up tails(0) with probability 1 p. Estimate p? Idea: find p that is most likely to generate the given data.
Maximum Likelihood Estimation A random variable X distributed according to D(θ) Given i.i.d. samples Goal: Estimate θ from D Idea: output መθ that is most likely to generate the data.