Deconstructing Data Science

Similar documents
Applied Natural Language Processing

Deconstructing Data Science

Applied Natural Language Processing

Deconstructing Data Science

Natural Language Processing

Learning From Crowds. Presented by: Bei Peng 03/24/15

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Machine Learning for NLP

Multiclass Multilabel Classification with More Classes than Examples

Essence of Machine Learning (and Deep Learning) Hoa M. Le Data Science Lab, HUST hoamle.github.io

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

Lecture 5 Neural models for NLP

Nonlinear Classification

The Perceptron. Volker Tresp Summer 2018

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Intelligent Systems (AI-2)

Classification: Analyzing Sentiment

Machine Learning. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 8 May 2012

Lecture 3: Multiclass Classification

CS 188: Artificial Intelligence. Outline

9 Classification. 9.1 Linear Classifiers

Randomized Decision Trees

The Perceptron. Volker Tresp Summer 2014

Natural Language Processing

Introduction. Chapter 1

Natural Language Processing

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Topics in Natural Language Processing

Linear and Logistic Regression. Dr. Xiaowei Huang

Intelligent Systems (AI-2)

Prediction of Citations for Academic Papers

Machine Learning for natural language processing

Machine Learning for NLP

CS 188: Artificial Intelligence Spring Today

Classification: Analyzing Sentiment

Classification objectives COMS 4771

The Perceptron. Volker Tresp Summer 2016

Machine Learning (CS 567) Lecture 2

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

ML in Practice: CMSC 422 Slides adapted from Prof. CARPUAT and Prof. Roth

Learning: Binary Perceptron. Examples: Perceptron. Separable Case. In the space of feature vectors

More Smoothing, Tuning, and Evaluation

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

CS 188: Artificial Intelligence. Machine Learning

Tufts COMP 135: Introduction to Machine Learning

MLE/MAP + Naïve Bayes

Applied Natural Language Processing

Lecture 4 Discriminant Analysis, k-nearest Neighbors

ML4NLP Multiclass Classification

Be able to define the following terms and answer basic questions about them:

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

18.9 SUPPORT VECTOR MACHINES

Machine Learning Basics

A1101, Lab 11: Galaxies and Rotation Lab Worksheet

ECE521 Lecture7. Logistic Regression

SF2930 Regression Analysis

Regression Adjustment with Artificial Neural Networks

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Bayesian Learning (II)

On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification

CS 188: Artificial Intelligence Fall 2011

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

CS Machine Learning Qualifying Exam

Errors, and What to Do. CS 188: Artificial Intelligence Fall What to Do About Errors. Later On. Some (Simplified) Biology

Classification, Linear Models, Naïve Bayes

CSCI-567: Machine Learning (Spring 2019)

Introduction to Support Vector Machines

The Benefits of a Model of Annotation

Introduction to Machine Learning Spring 2018 Note 18

CS 188: Artificial Intelligence Spring Announcements

Latent Dirichlet Allocation Introduction/Overview

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

CS 6375 Machine Learning

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

What Causality Is (stats for mathematicians)

Introduction to Machine Learning Midterm Exam Solutions

Modern Information Retrieval

Introduction to Machine Learning Midterm Exam

Variance Reduction and Ensemble Methods

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Machine Learning

Lecture 7: DecisionTrees

The Naïve Bayes Classifier. Machine Learning Fall 2017

Machine Learning, Fall 2009: Midterm

Evaluation Strategies

IFT Lecture 7 Elements of statistical learning theory

Classification and Pattern Recognition

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Hidden Markov Models

Stat 587: Key points and formulae Week 15

Decision Tree Learning Lecture 2

Logistic Regression & Neural Networks

CPSC 340: Machine Learning and Data Mining

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Bayesian Networks (Part I)

Transcription:

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 3: Classification overview Jan 24, 2017

Auditors Send me an email to get access to bcourses (announcements, readings, etc.)

Classification A mapping h from input data x (drawn from instance space X) to a label (or labels) y from some enumerable output space Y X = set of all skyscrapers Y = {art deco, neo-gothic, modern} x = the empire state building y = art deco

Recognizing a Classification Problem Can you formulate your question as a choice among some universe of possible classes? Can you create (or find) labeled data that marks that choice for a bunch of examples? Can you make that choice? Can you create features that might help in distinguishing those classes?

1. Those that belong to the emperor 2. Embalmed ones 3. Those that are trained 4. Suckling pigs 5. Mermaids (or Sirens) 6. Fabulous ones 7. Stray dogs 8. Those that are included in this classification 9. Those that tremble as if they were mad 10. Innumerable ones 11. Those drawn with a very fine camel hair brush 12. Et cetera 13. Those that have just broken the flower vase 14. Those that, at a distance, resemble flies The Celestial Emporium of Benevolent Knowledge from Borges (1942)

Conceptually, the most interesting aspect of this classification system is that it does not exist. Certain types of categorizations may appear in the imagination of poets, but they are never found in the practical or linguistic classes of organisms or of man-made objects used by any of the cultures of the world. Eleanor Rosch (1978), Principles of Categorization

Interannotator agreement annotator A annotator B puppy fried chicken puppy 6 3 fried chicken 2 5 observed agreement = 11/16 = 68.75% https://twitter.com/teenybiscuit/status/705232709220769792/photo/1

Cohen s kappa If classes are imbalanced, we can get high inter annotator agreement simply by chance annotator A annotator B puppy fried chicken puppy 7 4 fried chicken 8 81

Cohen s kappa If classes are imbalanced, we can get high inter annotator agreement simply by chance annotator A = p o p e 1 p e = 0.88 p e 1 p e annotator B puppy fried chicken puppy 7 4 fried chicken 8 81

Cohen s kappa Expected probability of agreement is how often we would expect two annotators to agree assuming independent annotations p e = P (A =puppy,b =puppy)+p (A =chicken,b =chicken) = P (A =puppy)p (B =puppy)+p (A =chicken)p (B =chicken)

Cohen s kappa = P (A =puppy)p (B =puppy)+p (A =chicken)p (B =chicken) P(A=puppy) 15/100 = 0.15 P(B=puppy) 11/100 = 0.11 P(A=chicken) 85/100 = 0.85 P(B=chicken) 89/100 = 0.89 =0.15 0.11 + 0.85 0.89 =0.773 annotator B puppy fried chicken puppy 7 4 fried chicken annotator A 8 81

Cohen s kappa If classes are imbalanced, we can get high inter annotator agreement simply by chance = p o p e 1 p e = 0.88 p e 1 p e = =0.471 0.88 0.773 1 0.773 annotator B puppy fried chicken puppy 7 4 fried chicken annotator A 8 81

Cohen s kappa Good values are subject to interpretation, but rule of thumb: 0.80-1.00 Very good agreement 0.60-0.80 Good agreement 0.40-0.60 Moderate agreement 0.20-0.40 Fair agreement < 0.20 Poor agreement

annotator A annotator B puppy fried chicken puppy 0 0 fried chicken 0 100

annotator A annotator B puppy fried chicken puppy 50 0 fried chicken 0 50

Interannotator agreement Cohen s kappa can be used for any number of classes. Still requires two annotators who evaluate the same items. Fleiss kappa generalizes to multiple annotators, each of whom may evaluate different items (e.g., crowdsourcing)

Classification problems

Classification Deep learning Decision trees Probabilistic graphical models Random forests Logistic regression Networks Support vector machines Neural networks Perceptron

Evaluation For all supervised problems, it s important to understand how well your model is performing What we try to estimate is how well you will perform in the future, on new data also drawn from X Trouble arises when the training data <x, y> you have does not characterize the full instance space. n is small sampling bias in the selection of <x, y> x is dependent on time y is dependent on time (concept drift)

Drift http://fivethirtyeight.com/features/the-end-of-a-republican-party/

X instance space labeled data

X instance space train test

Train/Test split To estimate performance on future unseen data, train a model on 80% and test that trained model on the remaining 20% What can go wrong here?

X instance space train test

X instance space train dev test

Experiment design training development testing size 80% 10% 10% purpose training models model selection evaluation; never look at it until the very end

Binary classification Binary classification: Y = 2 [one out of 2 labels applies to a given x] X Y image {puppy, fried chicken} https://twitter.com/teenybiscuit/status/705232709220769792/photo/1

Accuracy accuracy = number correctly predicted N 1 N N i=1 I[ŷ i = y i ] I[x] = 1 if x is true 0 otherwise Perhaps most intuitive single statistic when the number of positive/negative instances are comparable

Confusion matrix Predicted (ŷ) positive negative True (y) positive negative = correct

Confusion matrix Accuracy = 99.3% Predicted (ŷ) positive negative True (y) positive 48 70 negative 0 10,347 = correct

Sensitivity Sensitivity: proportion of true positives actually predicted to be positive (e.g., sensitivity of mammograms = proportion of people with cancer they identify as having cancer) a.k.a. positive recall, true positive N i=1 I(y i = ŷ i = pos) N i=1 I(y i = pos) True (y) Predicted (ŷ) positive negative positive 48 70 negative 0 10,347

Specificity Specificity: proportion of true negatives actually predicted to be negative (e.g., specificity of mammograms = proportion of people without cancer they identify as not having cancer) a.k.a. true negative Predicted (ŷ) positive negative positive 48 70 N i=1 I(y i = ŷ i = neg) N i=1 I(y i = neg) True (y) negative 0 10,347

Precision Precision: proportion of predicted class that are actually that class. I.e., if a class prediction is made, should you trust it? Predicted (ŷ) positive negative Precision(pos) = N i=1 I(y i = ŷ i = pos) N i=1 I(ŷ i = pos) True (y) positive 48 70 negative 0 10,347

Baselines No metric (accuracy, precision, sensitivity, etc.) is meaningful unless contextualized. Random guessing/majority class (balanced classes = 50%, imbalanced can be much higher) Simpler methods (e.g., election forecasting)

Scores Binary classification results in a categorical decision (+1/-1), but often through some intermediary score or probability ŷ = 1 if F i=1 x iβ i 0 1 0 otherwise Perceptron decision rule

Scores The most intuitive scores are probabilities: P(x = pos) = 0.74 P(x = neg) = 0.26

Multilabel Classification Multilabel classification: y > 1 [multiple labels apply to a given x] task X Y image tagging image {fun, B&W, color, ocean, }

Multilabel Classification For label space Y, we can view this as Y binary classification problems Where y j and y k may be dependent y 1 fun 0 y 2 B&W 0 y 3 color 1 (e.g., what s the relationship between y 2 and y 3?) y 5 sepia 0 y 6 ocean 1

Multiclass Classification Multiclass classification: Y > 2 [one out of N labels applies to a given x] task X Y authorship attribution text {jk rowling, james joyce, } genre classification song {hip-hop, classical, pop, }

Multiclass confusion matrix Predicted (ŷ) Democrat Republican Independent Democrat 100 2 15 True (y) Republican 0 104 30 Independent 30 40 70

True (y) Precision Precision(dem) = N i=1 I(y i = ŷ i = dem) N i=1 I(ŷ i = dem) Predicted (ŷ) Democrat Republican Independent Democrat 100 2 15 Precision: proportion of predicted class that are actually that class. Republican 0 104 30 Independent 30 40 70

True (y) Recall Recall(dem) = N i=1 I(y i = ŷ i = dem) N i=1 I(y i = dem) Predicted (ŷ) Democrat Republican Independent Democrat 100 2 15 Recall = generalized sensitivity (proportion of true class actually predicted to be that class) Republican 0 104 30 Independent 30 40 70

Democrat Republican Independent Precision 0.769 0.712 0.609 Recall 0.855 0.776 0.500 Predicted (ŷ) Democrat Republican Independent Democrat 100 2 15 True (y) Republican 0 104 30 Independent 30 40 70

Computational Social Science Lazer et al. (2009), Computational Social Science, Science. Grimmer (2015), We Are All Social Scientists Now: How Big Data, Machine Learning, and Causal Inference Work Together, APSA.

Computational Social Science Unprecedented amount of born-digital (and digitized) information about human behavior voting records of politicians online social network interactions census data expression of opinion (blogs, social media) search queries Project ideas: enhancing understanding of individuals and collectives

Computational Social Science How are people-as-data different from other forms of data? (e.g., physical/natural/biological objects)

Computational Social Science Draws on long traditions and rich methodologies in experimental design, sampling bias, causal inference. Accurate inference requires thoughtful measurement All methods have assumptions; part of scholarship is arguing where and when those assumptions are ok Science requires replicability. Assume your work will be replicated and document accordingly.