Social Media & Text Analysis

Similar documents
10-701/ Machine Learning - Midterm Exam, Fall 2010

CSE 546 Final Exam, Autumn 2013

Lecture 11 Linear regression

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

What s so Hard about Natural Language Understanding?

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Generative Clustering, Topic Modeling, & Bayesian Inference

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Midterm sample questions

CS 6375 Machine Learning

Introduction to Logistic Regression and Support Vector Machine

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning for NLP

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Machine Learning for NLP

Online Advertising is Big Business

CS229 Supplemental Lecture notes

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Deep Learning Lab Course 2017 (Deep Learning Practical)

Machine Learning, Fall 2012 Homework 2

Click Prediction and Preference Ranking of RSS Feeds

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Classification with Perceptrons. Reading:

Midterm: CS 6375 Spring 2015 Solutions

Logistic Regression. Machine Learning Fall 2018

IFT Lecture 7 Elements of statistical learning theory

Lecture 13: Structured Prediction

Predicting English keywords from Java Bytecodes using Machine Learning

CS145: INTRODUCTION TO DATA MINING

Learning Features from Co-occurrences: A Theoretical Analysis

CSE 546 Midterm Exam, Fall 2014

Hidden Markov Models

Machine Learning Lecture 7

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

CS 188: Artificial Intelligence Spring Announcements

Training the linear classifier

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

ECE 5984: Introduction to Machine Learning

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

CPSC 340: Machine Learning and Data Mining

CS 188: Artificial Intelligence Spring Announcements

Least Mean Squares Regression

DM-Group Meeting. Subhodip Biswas 10/16/2014

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Logistic Regression. COMP 527 Danushka Bollegala

Machine Learning: Assignment 1

Applied Natural Language Processing

Spatial Role Labeling CS365 Course Project

Warm up: risk prediction with logistic regression

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Bias-Variance Tradeoff

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Machine Learning (CS 567) Lecture 3

FINAL: CS 6375 (Machine Learning) Fall 2014

CS 188: Artificial Intelligence. Outline

Instance-based Domain Adaptation

Logistic Regression. William Cohen

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Multi-theme Sentiment Analysis using Quantified Contextual

ECE521 Lectures 9 Fully Connected Neural Networks

Natural Language Processing

Introduction to Support Vector Machines

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Binary Classification / Perceptron

total 58

Neural Networks: Backpropagation

Department of Computer Science and Engineering Indian Institute of Technology, Kanpur. Spatial Role Labeling

Stochastic Gradient Descent

Introduction to Machine Learning Midterm Exam

ML4NLP Multiclass Classification

Linear and Logistic Regression. Dr. Xiaowei Huang

Recent Advances in Bayesian Inference Techniques

Midterm, Fall 2003

Information Extraction from Text

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

Bayesian Paragraph Vectors

Hidden Markov Models

Logistic Regression Trained with Different Loss Functions. Discussion

Machine Learning Lecture 5

Naïve Bayes, Maxent and Neural Models

6.036 midterm review. Wednesday, March 18, 15

Machine Learning for Signal Processing Bayes Classification and Regression

lecture 6: modeling sequences (final part)

CS145: INTRODUCTION TO DATA MINING

The Perceptron. Volker Tresp Summer 2014

Lecture 4 Logistic Regression

Experiment 1: Linear Regression

Conditional Random Field

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Transcription:

Social Media & Text Analysis lecture 5 - Paraphrase Identification and Logistic Regression CSE 5539-0010 Ohio State University Instructor: Wei Xu Website: socialmedia-class.org

In-class Presentation pick your topic and sign up a 10 minute presentation (20 points) - A Social Media Platform - Or a NLP Researcher

Reading #6 & Quiz #3 socialmedia-class.org

Mini Research Proposal propose/explore NLP problems in GitHub dataset https://github.com

Mini Research Proposal propose/explore NLP problems in GitHub dataset GitHub: - a social network for programmers (sharing, collaboration, bug tracking, etc.) - hosting Git repositories (a version control system that tracks changes to files, usually source code) - containing potentially interesting text fields in natural language (comments, issues, etc.)

(Recap) what is Paraphrase? sentences or phrases that convey approximately the same meaning using different words (Bhagat & Hovy, 2012)

(Recap) what is Paraphrase? sentences or phrases that convey approximately the same meaning using different words (Bhagat & Hovy, 2012) wealthy word rich

(Recap) what is Paraphrase? sentences or phrases that convey approximately the same meaning using different words (Bhagat & Hovy, 2012) wealthy the king s speech word phrase rich His Majesty s address

(Recap) what is Paraphrase? sentences or phrases that convey approximately the same meaning using different words (Bhagat & Hovy, 2012) wealthy the king s speech the forced resignation of the CEO of Boeing, Harry Stonecipher, for word phrase sentence rich His Majesty s address after Boeing Co. Chief Executive Harry Stonecipher was ousted from

The Ideal

(Recap) Paraphrase Research 80s WordNet 01 Web '01 Novels '04 News '05 '13 Bi-Text '11 Video '12* Style '13* '14* Twitter '15*'16* Simple Xu Ritter Dolan Grishman Cherry Xu Ritter Callison-Burch Dolan Ji Xu Callison-Burch Napoles * my research Wei Xu. Data-driven Approaches for Paraphrasing Across Language Variations PhD Thesis (2014)

DIRT (Discovery of Inference Rules from Text) Lin and Panel (2001) operationalize the Distributional Hypothesis using dependency relationships to define similar environments. Duty and responsibility share a similar set of dependency contexts in large volumes of text: modified by adjectives additional, administrative, assigned, assumed, collective, congressional, constitutional... objects of verbs assert, assign, assume, attend to, avoid, become, breach... Decking Lin and Patrick Pantel. DIRT - Discovery of Inference Rules from Text In KDD (2001)

Bilingual Pivoting word alignment... 5 farmers were thrown into jail in Ireland...... fünf Landwirte festgenommen, weil...... oder wurden festgenommen, gefoltert...... or have been imprisoned, tortured... Source: Chris Callison-Burch

Bilingual Pivoting word alignment... 5 farmers were thrown into jail in Ireland...... fünf Landwirte festgenommen, weil...... oder wurden festgenommen, gefoltert...... or have been imprisoned, tortured... Source: Chris Callison-Burch

Bilingual Pivoting word alignment... 5 farmers were thrown into jail in Ireland...... fünf Landwirte festgenommen, weil...... oder wurden festgenommen, gefoltert...... or have been imprisoned, tortured... Source: Chris Callison-Burch

Bilingual Pivoting word alignment... 5 farmers were thrown into jail in Ireland...... fünf Landwirte festgenommen, weil...... oder wurden festgenommen, gefoltert...... or have been imprisoned, tortured... Source: Chris Callison-Burch

Bilingual Pivoting word alignment... 5 farmers were thrown into jail in Ireland...... fünf Landwirte festgenommen, weil...... oder wurden festgenommen, gefoltert...... or have been imprisoned, tortured... Source: Chris Callison-Burch

Quiz #2 Key Limitations of PPDB?

Quiz #2 word sense microbe, virus, bacterium, germ, parasite insect, beetle, pest, mosquito, fly bother, annoy, pester bug microphone, tracker, mic, wire, earpiece, cookie glitch, error, malfunction, fault, failure squealer, snitch, rat, mole Source: Chris Callison-Burch

Another Key Limitation 80s WordNet 01 Web '01 Novels '04 News '05 '13 Bi-Text '11 Video '12* Style '13* '14* Twitter '15*'16* Simple only paraphrases, no non-paraphrases * my research Wei Xu. Data-driven Approaches for Paraphrasing Across Language Variations PhD Thesis (2014)

Paraphrase Identification obtain sentential paraphrases automatically Mancini has been sacked by Manchester City Mancini gets the boot from Man City Yes!% WORLD OF JENKS IS ON AT 11 No!$ World of Jenks is my favorite show on tv (meaningful) non-paraphrases are needed to train classifiers! Wei Xu, Alan Ritter, Chris Callison-Burch, Bill Dolan, Yangfeng Ji. Extracting Lexically Divergent Paraphrases from Twitter In

Also Non-Paraphrases 80s WordNet 01 Web '01 Novels '04 News '05 '13 Bi-Text '11 Video '12* Style '13* '14* Twitter '15*'16* Simple (meaningful) non-paraphrases are needed to train classifiers! * my research Wei Xu. Data-driven Approaches for Paraphrasing Across Language Variations PhD Thesis (2014)

News Paraphrase Corpus Microsoft Research Paraphrase Corpus also contains some non-paraphrases (Dolan, Quirk and Brockett, 2004; Dolan and Brockett, 2005; Brockett and Dolan, 2005)

Twitter Paraphrase Corpus also contains a lot of non-paraphrases Wei Xu, Alan Ritter, Chris Callison-Burch, Bill Dolan, Yangfeng Ji. Extracting Lexically Divergent Paraphrases from Twitter In TACL (2014)

Paraphrase Identification: A Binary Classification Problem Input: - a sentence pair x - a fixed set of binary classes Y = {0, 1} Output: - a predicted class y Y (y = 0 or y = 1)

Paraphrase Identification: A Binary Classification Problem Input: - a sentence pair x negative (non-paraphrases) - a fixed set of binary classes Y = {0, 1} Output: - a predicted class y Y (y = 0 or y = 1)

Paraphrase Identification: A Binary Classification Problem Input: - a sentence pair x negative (non-paraphrases) - a fixed set of binary classes Y = {0, 1} positive (paraphrases) Output: - a predicted class y Y (y = 0 or y = 1)

Paraphrase Identification: A Binary Classification Problem Input: - a sentence pair x - a fixed set of binary classes Y = {0, 1} Output: - a predicted class y Y (y = 0 or y = 1)

Classification Method: Supervised Machine Learning Input: - a sentence pair x - a fixed set of binary classes Y = {0, 1} - a training set of m hand-labeled sentence pairs (x (1), y (1) ),, (x (m), y (m) ) Output: - a learned classifier γ: x y Y (y = 0 or y = 1)

Classification Method: Supervised Machine Learning Input: - a sentence pair x (represented by features) - a fixed set of binary classes Y = {0, 1} - a training set of m hand-labeled sentence pairs (x (1), y (1) ),, (x (m), y (m) ) Output: - a learned classifier γ: x y Y (y = 0 or y = 1)

(Recap Week #3) Classification Method: Supervised Machine Learning Naïve Bayes Logistic Regression Support Vector Machines (SVM)

(Recap Week#3) Cons: Naïve Bayes features ti are assumed independent given the class y P(t 1,t 2,...,t n y) = P(t 1 y) P(t 2 y)... P(t n y) This will cause problems: - correlated features double-counted evidence - while parameters are estimated independently - hurt classier s accuracy

Classification Method: Supervised Machine Learning Naïve Bayes Logistic Regression Support Vector Machines (SVM)

Logistic Regression One of the most useful supervised machine learning algorithm for classification! Generally high performance for a lot of problems. Much more robust than Naïve Bayes (better performance on various datasets).

Before Logistic Regression Let s start with something simpler!

Paraphrase Identification: Simplified Features We use only one feature: - number of words that two sentence shared in common

A very related problem of Paraphrase Identification: Semantic Textual Similarity How similar (close in meaning) two sentences are? 5: completely equivalent in meaning 4: mostly equivalent, but some unimportant details differ 3: roughly equivalent, some important information differs/missing 2: not equivalent, but share some details 1: not equivalent, but are on the same topic 0: completely dissimilar

A Simpler Model: Linear Regression 5 Sentence Similarity (rated by Human) 4 3 2 1 0 0 5 10 15 20 #words in common (feature)

A Simpler Model: Linear Regression 5 Sentence Similarity 4 3 2 1 0 0 5 10 15 20 #words in common (feature) also supervised learning (learn from annotated data) but for Regression: predict real-valued output (Classification: predict discrete-valued output)

A Simpler Model: Linear Regression 5 Sentence Similarity 4 3 2 1 threshold Classification 0 0 5 10 15 20 #words in common (feature) also supervised learning (learn from annotated data) but for Regression: predict real-valued output (Classification: predict discrete-valued output)

A Simpler Model: Linear Regression paraphrase{ Sentence Similarity 5 4 3 2 1 threshold Classification 0 0 5 10 15 20 #words in common (feature) also supervised learning (learn from annotated data) but for Regression: predict real-valued output (Classification: predict discrete-valued output)

A Simpler Model: Linear Regression 5 { paraphrase { non-paraphrase Sentence Similarity 4 3 2 1 threshold Classification 0 0 5 10 15 20 #words in common (feature) also supervised learning (learn from annotated data) but for Regression: predict real-valued output (Classification: predict discrete-valued output)

Training Set #words in common (x) Sentence Similarity (y) 1 0 4 1 13 4 18 5 - m hand-labeled sentence pairs (x (1), y (1) ),, (x (m), y (m) ) - x s: input variable / features - y s: output / target variable

(Recap Week#3) Supervised Machine Learning training set Source: NLTK Book

Supervised Machine Learning training set (also called) hypothesis

Supervised Machine Learning training set #words in common Sentence Similarity (also called) hypothesis

Supervised Machine Learning training set x (estimated) y #words in common Sentence Similarity (also called) hypothesis

Linear Regression: Model Representation How to represent h? training set h θ (x) = θ 0 +θ 1 x x (estimated) y Linear Regression w/ one variable hypothesis (h)

Linear Regression w/ one variable: Model Representation #words in common (x) Sentence Similarity (y) 1 0 4 1 13 4 h θ (x) = θ 0 +θ 1 x 18 5 - m hand-labeled sentence pairs (x (1), y (1) ),, (x (m), y (m) ) θ s: parameters Source: many following slides are adapted from Andrew Ng

Linear Regression w/ one variable: Model Representation #words in common (x) Sentence Similarity (y) 1 0 4 1 13 4 h θ (x) = θ 0 +θ 1 x 18 5 - m hand-labeled sentence pairs (x (1), y (1) ),, (x (m), y (m) ) - θ s: parameters How to choose θ?

Linear Regression w/ one variable:: Model Representation 3' 3' 3' 2' 2' 2' 1' 1' 1' 0' 0' 1' 2' 3' 0' 0' 1' 2' 3' 0' 0' 1' 2' 3' Andrew'Ng'

Linear Regression w/ one variable: Cost Function 5 Sentence Similarity (rated by Human) 4 3 2 1 0 0 5 10 15 20 #words in common (feature) Idea: choose θ0, θ1 so that hθ(x) is close to y for training examples (x, y)

Linear Regression w/ one variable: Cost Function 5 Sentence Similarity (rated by Human) 4 3 2 1 0 0 5 10 15 20 #words in common (feature) Idea: choose θ0, θ1 so that hθ(x) is close to y for training examples (x, y)

Linear Regression w/ one variable: Cost Function 5 Sentence Similarity (rated by Human) 4 3 2 1 0 0 5 10 15 20 #words in common (feature) squared error function: J(θ 0,θ 1 ) = 1 m 2m (h θ (x (i) ) y (i) ) 2 i=1 Idea: choose θ0, θ1 so that hθ(x) is close to y for training examples (x, y) minimize θ 0, θ 1 J(θ 0,θ 1 )

Linear Regression Hypothesis: h θ (x) = θ 0 +θ 1 x Parameters: θ0, θ1 Cost Function: J(θ 0,θ 1 ) = 1 Goal: ' m 2m (h θ (x(i) ) y (i) ) 2 i=1 minimize θ 0, θ 1 J(θ 0,θ 1 )

Hypothesis: Linear Regression Simplified h θ (x) = θ 0 +θ 1 x h θ (x) = θ 1 x Parameters: θ0, θ1 θ1 Cost Function: J(θ 0,θ 1 ) = 1 Goal: ' m 2m (h θ (x(i) ) y (i) ) 2 i=1 minimize θ 0, θ 1 J(θ 0,θ 1 ) J(θ 1 ) = 1 m 2m (h θ (x (i) ) y (i) ) 2 i=1 minimize θ 1 J(θ 1 )

h θ (x) (for fixed θ1, this is a function of x ) J(θ 1 ) (function of the parameter θ1 ) 3 3 y 2 1 θ1=1 J(θ1) 2 1 0 0 1 2 3 x J(1) = 1 2 3 [(1 1)2 + (2 2) 2 + (3 3) 2 ] = 0-0.5 0 0.5 1 1.5 2 2.5 θ1

h θ (x) (for fixed θ1, this is a function of x ) J(θ 1 ) (function of the parameter θ1 ) 3 3 y 2 1 θ1=0.5 J(θ1) 2 1 0 0 1 2 3 x J(1) = 1 2 3 [(0.5 1)2 + (1 2) 2 + (1.5 3) 2 ] = 0.68-0.5 0 0.5 1 1.5 2 2.5 θ1

h θ (x) (for fixed θ1, this is a function of x ) J(θ 1 ) (function of the parameter θ1 ) 3 3 y 2 1 J(θ1) 2 1 0 0 1 2 3 x -0.5 0 0.5 1 1.5 2 2.5 θ1 minimize θ 1 J(θ 1 )

h θ (x) J(θ 0,θ 1 ) (for fixed θ0, θ1, this is a function of x ) (function of the parameter θ0, θ1 ) J(θ1, θ2) θ1 θ0

Hypothesis: Linear Regression Simplified h θ (x) = θ 0 +θ 1 x h θ (x) = θ 1 x Parameters: θ0, θ1 θ1 Cost Function: J(θ 0,θ 1 ) = 1 Goal: ' m 2m (h θ (x(i) ) y (i) ) 2 i=1 minimize θ 0, θ 1 J(θ 0,θ 1 ) J(θ 1 ) = 1 m 2m (h θ (x (i) ) y (i) ) 2 i=1 minimize θ 1 J(θ 1 )

h θ (x) (for fixed θ0, θ1, this is a function of x ) J(θ 0,θ 1 ) (function of the parameter θ0, θ1 ) contour plot Andrew'Ng'

h θ (x) (for fixed θ0, θ1, this is a function of x ) J(θ 0,θ 1 ) (function of the parameter θ0, θ1 ) Andrew'Ng'

h θ (x) (for fixed θ0, θ1, this is a function of x ) J(θ 0,θ 1 ) (function of the parameter θ0, θ1 ) Andrew'Ng'

Parameter Learning Have some function J(θ 0,θ 1 ) Want min J(θ,θ ) 0 1 θ 0, θ 1 Outline: - Start with some θ0, θ1 - Keep changing θ0, θ1 to reduce J(θ1, θ2) until we hopefully end up at a minimum

Gradient Descent 3 J(θ1) 2 1-0.5 0 0.5 1 1.5 2 2.5 θ1 minimize θ 1 J(θ 1 )

Gradient Descent 3 J(θ1) 2 1 θ 1 := θ 1 α θ 1 J(θ 1 ) -0.5 0 0.5 1 1.5 2 2.5 θ1 learning rate minimize θ 1 J(θ 1 )

Gradient Descent J(θ 0,θ 1 ) " θ 1" θ 0" minimize θ 0,θ 1 J(θ 0,θ 1 ) Andrew'Ng'

Gradient Descent J(θ 0,θ 1 ) " θ 0" θ 1" minimize θ 0,θ 1 J(θ 0,θ 1 ) Andrew'Ng'

Gradient Descent repeat until convergence { θ j := θ j α θ j J(θ 0,θ 1 ) (simultaneous update for j=0 and j=1) } learning rate

Linear Regression w/ one variable: Gradient Descent Cost Function h θ (x) = θ 0 +θ 1 x J(θ 0,θ 1 ) = 1 m 2m (h θ (x(i) ) y (i) ) 2 i=1 θ 0 J(θ 0,θ 1 ) =? θ 1 J(θ 0,θ 1 ) =?

Linear Regression w/ one variable: Gradient Descent repeat until convergence { } m simultaneous θ 0 := θ 0 α 1 m (h θ (x (i) ) y (i) ) i=1 m i=1 θ := θ α 1 1 1 m (h (x (i) ) y (i) ) x (i) θ update θ0, θ1

Linear Regression cost function is convex J(θ1, θ2) θ1 θ0

h θ (x) (for fixed θ0, θ1, this is a function of x ) J(θ 0,θ 1 ) (function of the parameter θ0, θ1 ) Andrew'Ng'

h θ (x) (for fixed θ0, θ1, this is a function of x ) J(θ 0,θ 1 ) (function of the parameter θ0, θ1 ) Andrew'Ng'

h θ (x) (for fixed θ0, θ1, this is a function of x ) J(θ 0,θ 1 ) (function of the parameter θ0, θ1 ) Andrew'Ng'

h θ (x) (for fixed θ0, θ1, this is a function of x ) J(θ 0,θ 1 ) (function of the parameter θ0, θ1 ) Andrew'Ng'

h θ (x) (for fixed θ0, θ1, this is a function of x ) J(θ 0,θ 1 ) (function of the parameter θ0, θ1 ) Andrew'Ng'

h θ (x) (for fixed θ0, θ1, this is a function of x ) J(θ 0,θ 1 ) (function of the parameter θ0, θ1 ) Andrew'Ng'

h θ (x) (for fixed θ0, θ1, this is a function of x ) J(θ 0,θ 1 ) (function of the parameter θ0, θ1 ) Andrew'Ng'

h θ (x) (for fixed θ0, θ1, this is a function of x ) J(θ 0,θ 1 ) (function of the parameter θ0, θ1 ) Andrew'Ng'

h θ (x) (for fixed θ0, θ1, this is a function of x ) J(θ 0,θ 1 ) (function of the parameter θ0, θ1 ) Andrew'Ng'

Batch Update Each step of gradient descent uses all the training examples

(Recap) Linear Regression 5 Sentence Similarity 4 3 2 1 threshold Classification 0 0 5 10 15 20 #words in common (feature) also supervised learning (learn from annotated data) but for Regression: predict real-valued output (Classification: predict discrete-valued output)

(Recap) Linear Regression paraphrase{ Sentence Similarity 5 4 3 2 1 threshold Classification 0 0 5 10 15 20 #words in common (feature) also supervised learning (learn from annotated data) but for Regression: predict real-valued output (Classification: predict discrete-valued output)

(Recap) Linear Regression 5 { paraphrase { non-paraphrase Sentence Similarity 4 3 2 1 threshold Classification 0 0 5 10 15 20 #words in common (feature) also supervised learning (learn from annotated data) but for Regression: predict real-valued output (Classification: predict discrete-valued output)

(Recap) Hypothesis: Linear Regression h θ (x) = θ 0 +θ 1 x Parameters: J(θ1, θ2) θ0, θ1 Cost Function: J(θ 0,θ 1 ) = 1 Goal: m 2m (h θ (x(i) ) y (i) ) 2 i=1 minimize θ 0, θ 1 J(θ 0,θ 1 ) θ1 θ0

(Recap) Gradient Descent repeat until convergence { θ j := θ j α θ j J(θ 0,θ 1 ) (simultaneous update for j=0 and j=1) } learning rate

Next Class: - Logistic Regression (cont ) socialmedia-class.org