Are you talking Bernoulli to me? Comparing methods of assessing word frequencies

Size: px
Start display at page:

Download "Are you talking Bernoulli to me? Comparing methods of assessing word frequencies"

Transcription

1 Comparing methods of assessing word frequencies * Joint work with Panagiotis Papapetrou *, Tanja Säily **, Kai Puolamäki *, Terttu Nevalainen **, and Heikki Mannila * * Department of Information and Computer Science, Aalto University ** Department of Modern Languages, University of Helsinki

2 Problem setting Given two corpora S and T Find all words that are significantly more frequent in S than in T, or vice versa Word Freq in S Freq in T mr Total Is this statistically significant? 2

3 Motivation Find differences between groups Speaker groups of different ages S = 2 3, T = 4 5 Genres S = newspaper, T = magazines Author gender S = male, T = female Time periods S = , T =

4 Data Text = sequence of words Corpus of Early English Correspondence (CEEC) Version with normalized spelling Compare periods and M words 3+ letters Preprocessing: remove punctuation from all words 4

5 Problem setting Input: Two corpora: S and T A significance threshold: α ( < α < 1) Word q is significant at level α if and only if p α p gives the probability for S and T having equal means Two-tailed tests: test direction separately 5

6 Log-likelihood ratio test (Dunning 1993) Assume all words are independent Bag-of-words model Significance test using 2x2 table Bin(k,n, fr) = n fr k 1 fr k ( ) n k λ = Bin(k S,n S, fr S+T ) Bin(k T,n T, fr S+T ) Bin(k S,n S, fr S ) Bin(k T,n T, fr T ) Word Freq in S Freq in T mr Total log λ ~ χ 2 p =.4 6

7 Bag-of-words model (log-likelihood ratio test, χ 2 -test, Fisher s exact test, binomial test) Assume all words are independent However: texts have structure! Why the bag-of-words model then? Mathematically simple Computationally efficient Core questions: Can we provide more realistic models? Does it matter? 7

8 Many words are bursty for (n = 8792) Data: British National Corpus, 449 texts I (n = 86897) Number of texts Normalized frequency Normalized frequency Frequency distribution differs per word Depends on frequency and word type 8

9 Proposed method 1: Inter-arrival times (Lijffijt et al. 211) Count space between consecutive occurrences (of and) Finnair believes that it will be able to resume its scheduled service to and from New York on Monday, after two days of cancellations caused by hurricane Irene. All three airports serving New York City have been closed because of the hurricane and Finnair was forced to cancel flights on Saturday and Sunday. The airline is not certain when its scheduled service can be resumed, but the assumption is that Monday afternoon's flight from Helsinki will depart. Some Finnair passengers whose final destination is not New York have been rerouted and some have delayed travel plans. The company has also offered ticket holders a refund. YLE IAT and = {29, 9, 39, 29} Hypothesis: this captures the behavior pattern of words Altmann et al. 29: Predict word class based on IAT 9

10 Proposed method 1: Inter-arrival times (Lijffijt et al. 211) Count space between consecutive occurrences IAT distribution Resampling of S and T 1. Pick random first index 2. Sample random inter-arrival time 3. Repeat 2. until size of S exceeded X X X S Produce random corpora: S 1,, S N and T 1,, T N N I( freq(q,s i ) freq(q,t i )) p p 2 = 1+ N 2 min(p 1,1 p 1 ) 1 = i=1 N 1+ N 1

11 Proposed method 2: Bootstrapping (Lijffijt et al. 211) Resampling based on word frequency distribution Number of texts equal to number of texts in S Produce random corpora: S 1,, S N and T 1,, T N p 1 = N i=1 I( freq(q,s i ) freq(q,t i )) N p 2 = 1+ N 2 min(p 1,1 p 1 ) 1+ N 11

12 Comparison for mr Word Freq in S Freq in T mr Total p χ2 =.43 p log-likelihood =.4 p IAT =.747 p bootstrap =.143 Maybe the difference is not so significant! 12

13 % texts 6 4 Total freq : 1637 (.29 %) Total freq : 284 (.32 %) % mr

14 Experiments Compare four methods χ 2 -test Log-likelihood ratio test Inter-arrival time test Bootstrap test Compute p-values for all words Hypothesis: mean frequency in and are equal 14

15 1.8 p value Log likelihood p value Inter arrival time

16 1.8 p value Log likelihood p value Bootstrap

17 1.8 p value Log likelihood p value Chi squared

18 1.8 p value Inter arrival time p value Chi squared

19 1.8 p value Bootstrap p value Chi squared

20 1.8 p value Bootstrap p value Inter arrival time

21 Examples of words with > 1-fold difference Word % % p LL p Boot him we mr horse prince goods patent pound merchant li

22 Conclusion Bag-of-words model poorly represents frequency distributions New methods: inter-arrival times and bootstrap method (Lijffijt et al. 211) Take into account burstiness (or dispersion) of words More conservative p-values Not covered in presentation: - Correction for multiple hypotheses Future work: - More in-depth analysis of differences between methods - Dependency between words 22

23 References Altmann, E.G., Pierrehumbert, J.B., Motter, A.E. (29). Beyond word frequency: Bursts, lulls, and scaling in the temporal distribution of words, PLoS ONE, 4(11):e7678. Dunning, T. (1993). Accurate Methods for the Statistics of Surprise and Coincidence, Computational Linguistics, 19: Lijffijt, J., Papapetrou, P., Puolamäki, K., Mannila, H. (211). Analyzing Word Frequencies in Large Text Corpora using Inter-arrival Times and Bootstrapping. In ECML PKDD 211, Part II, pp

24 Most frequent significant words Word % % p LL p Boot my <.1.1 that <.1.1 your <.1.1 it <.1.1 is <.1.1 and <.1.1 with <.1.1 but <.1.1 in < words significant at α =.5 in all four methods 24

Are you talking Bernoulli to me? Significance testing and burstiness of words in text corpora

Are you talking Bernoulli to me? Significance testing and burstiness of words in text corpora Significance testing and burstiness of words in text corpora Doctoral Student and Researcher Department of Information and Computer Science, Aalto University Helsinki Institute for Information Technology

More information

Applied Natural Language Processing

Applied Natural Language Processing Applied Natural Language Processing Info 256 Lecture 3: Finding Distinctive Terms (Jan 29, 2019) David Bamman, UC Berkeley https://www.nytimes.com/interactive/ 2017/11/07/upshot/modern-love-what-wewrite-when-we-write-about-love.html

More information

Ling 289 Contingency Table Statistics

Ling 289 Contingency Table Statistics Ling 289 Contingency Table Statistics Roger Levy and Christopher Manning This is a summary of the material that we ve covered on contingency tables. Contingency tables: introduction Odds ratios Counting,

More information

Analysis of Variance

Analysis of Variance Analysis of Variance Chapter 12 McGraw-Hill/Irwin Copyright 2013 by The McGraw-Hill Companies, Inc. All rights reserved. Learning Objectives LO 12-1 List the characteristics of the F distribution and locate

More information

Contingency Tables. Safety equipment in use Fatal Non-fatal Total. None 1, , ,128 Seat belt , ,878

Contingency Tables. Safety equipment in use Fatal Non-fatal Total. None 1, , ,128 Seat belt , ,878 Contingency Tables I. Definition & Examples. A) Contingency tables are tables where we are looking at two (or more - but we won t cover three or more way tables, it s way too complicated) factors, each

More information

We know from STAT.1030 that the relevant test statistic for equality of proportions is:

We know from STAT.1030 that the relevant test statistic for equality of proportions is: 2. Chi 2 -tests for equality of proportions Introduction: Two Samples Consider comparing the sample proportions p 1 and p 2 in independent random samples of size n 1 and n 2 out of two populations which

More information

Testing Independence

Testing Independence Testing Independence Dipankar Bandyopadhyay Department of Biostatistics, Virginia Commonwealth University BIOS 625: Categorical Data & GLM 1/50 Testing Independence Previously, we looked at RR = OR = 1

More information

Predicting flight on-time performance

Predicting flight on-time performance 1 Predicting flight on-time performance Arjun Mathur, Aaron Nagao, Kenny Ng I. INTRODUCTION Time is money, and delayed flights are a frequent cause of frustration for both travellers and airline companies.

More information

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24 L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,

More information

Natural Language Processing SoSe Words and Language Model

Natural Language Processing SoSe Words and Language Model Natural Language Processing SoSe 2016 Words and Language Model Dr. Mariana Neves May 2nd, 2016 Outline 2 Words Language Model Outline 3 Words Language Model Tokenization Separation of words in a sentence

More information

Chi-Squared Tests. Semester 1. Chi-Squared Tests

Chi-Squared Tests. Semester 1. Chi-Squared Tests Semester 1 Goodness of Fit Up to now, we have tested hypotheses concerning the values of population parameters such as the population mean or proportion. We have not considered testing hypotheses about

More information

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi) Natural Language Processing SoSe 2015 Language Modelling Dr. Mariana Neves April 20th, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline 2 Motivation Estimation Evaluation Smoothing Outline 3 Motivation

More information

Language as a Stochastic Process

Language as a Stochastic Process CS769 Spring 2010 Advanced Natural Language Processing Language as a Stochastic Process Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Basic Statistics for NLP Pick an arbitrary letter x at random from any

More information

13.1 Categorical Data and the Multinomial Experiment

13.1 Categorical Data and the Multinomial Experiment Chapter 13 Categorical Data Analysis 13.1 Categorical Data and the Multinomial Experiment Recall Variable: (numerical) variable (i.e. # of students, temperature, height,). (non-numerical, categorical)

More information

Hypothesis Testing: Chi-Square Test 1

Hypothesis Testing: Chi-Square Test 1 Hypothesis Testing: Chi-Square Test 1 November 9, 2017 1 HMS, 2017, v1.0 Chapter References Diez: Chapter 6.3 Navidi, Chapter 6.10 Chapter References 2 Chi-square Distributions Let X 1, X 2,... X n be

More information

MAT2377. Ali Karimnezhad. Version December 13, Ali Karimnezhad

MAT2377. Ali Karimnezhad. Version December 13, Ali Karimnezhad MAT2377 Ali Karimnezhad Version December 13, 2016 Ali Karimnezhad Comments These slides cover material from Chapter 4. In class, I may use a blackboard. I recommend reading these slides before you come

More information

Exercise 1: Basics of probability calculus

Exercise 1: Basics of probability calculus : Basics of probability calculus Stig-Arne Grönroos Department of Signal Processing and Acoustics Aalto University, School of Electrical Engineering stig-arne.gronroos@aalto.fi [21.01.2016] Ex 1.1: Conditional

More information

Optimum parameter selection for K.L.D. based Authorship Attribution for Gujarati

Optimum parameter selection for K.L.D. based Authorship Attribution for Gujarati Optimum parameter selection for K.L.D. based Authorship Attribution for Gujarati Parth Mehta DA-IICT, Gandhinagar parth.mehta126@gmail.com Prasenjit Majumder DA-IICT, Gandhinagar prasenjit.majumder@gmail.com

More information

Ron Heck, Fall Week 3: Notes Building a Two-Level Model

Ron Heck, Fall Week 3: Notes Building a Two-Level Model Ron Heck, Fall 2011 1 EDEP 768E: Seminar on Multilevel Modeling rev. 9/6/2011@11:27pm Week 3: Notes Building a Two-Level Model We will build a model to explain student math achievement using student-level

More information

Contingency Tables. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels.

Contingency Tables. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels. Contingency Tables Definition & Examples. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels. (Using more than two factors gets complicated,

More information

Lecture 41 Sections Wed, Nov 12, 2008

Lecture 41 Sections Wed, Nov 12, 2008 Lecture 41 Sections 14.1-14.3 Hampden-Sydney College Wed, Nov 12, 2008 Outline 1 2 3 4 5 6 7 one-proportion test that we just studied allows us to test a hypothesis concerning one proportion, or two categories,

More information

PROBABILITY.

PROBABILITY. PROBABILITY PROBABILITY(Basic Terminology) Random Experiment: If in each trial of an experiment conducted under identical conditions, the outcome is not unique, but may be any one of the possible outcomes,

More information

This does not cover everything on the final. Look at the posted practice problems for other topics.

This does not cover everything on the final. Look at the posted practice problems for other topics. Class 7: Review Problems for Final Exam 8.5 Spring 7 This does not cover everything on the final. Look at the posted practice problems for other topics. To save time in class: set up, but do not carry

More information

Introduction to Statistical Data Analysis Lecture 7: The Chi-Square Distribution

Introduction to Statistical Data Analysis Lecture 7: The Chi-Square Distribution Introduction to Statistical Data Analysis Lecture 7: The Chi-Square Distribution James V. Lambers Department of Mathematics The University of Southern Mississippi James V. Lambers Statistical Data Analysis

More information

AoCMM #692. October 11, 2017

AoCMM #692. October 11, 2017 AoCMM #692 October 11, 2017 #692 1 Contents 1 Summary 3 1.1 Interpretation and Methods of roblem 1................ 3 1.2 Interpretation and Methods of roblem 2a............... 4 1.3 Interpretation and

More information

Statistics 3858 : Contingency Tables

Statistics 3858 : Contingency Tables Statistics 3858 : Contingency Tables 1 Introduction Before proceeding with this topic the student should review generalized likelihood ratios ΛX) for multinomial distributions, its relation to Pearson

More information

Report on Wiarton Keppel International Airport for Georgian Bluffs Prepared by: Alec Dare, BA, OCGC Research Analyst, Centre for Applied Research and

Report on Wiarton Keppel International Airport for Georgian Bluffs Prepared by: Alec Dare, BA, OCGC Research Analyst, Centre for Applied Research and Report on Wiarton Keppel International Airport for Georgian Bluffs Prepared by: Alec Dare, BA, OCGC Research Analyst, Centre for Applied Research and Innovation Georgian College March 2018 Contents 1.

More information

GROUPED DATA E.G. FOR SAMPLE OF RAW DATA (E.G. 4, 12, 7, 5, MEAN G x / n STANDARD DEVIATION MEDIAN AND QUARTILES STANDARD DEVIATION

GROUPED DATA E.G. FOR SAMPLE OF RAW DATA (E.G. 4, 12, 7, 5, MEAN G x / n STANDARD DEVIATION MEDIAN AND QUARTILES STANDARD DEVIATION FOR SAMPLE OF RAW DATA (E.G. 4, 1, 7, 5, 11, 6, 9, 7, 11, 5, 4, 7) BE ABLE TO COMPUTE MEAN G / STANDARD DEVIATION MEDIAN AND QUARTILES Σ ( Σ) / 1 GROUPED DATA E.G. AGE FREQ. 0-9 53 10-19 4...... 80-89

More information

Nominal Data. Parametric Statistics. Nonparametric Statistics. Parametric vs Nonparametric Tests. Greg C Elvers

Nominal Data. Parametric Statistics. Nonparametric Statistics. Parametric vs Nonparametric Tests. Greg C Elvers Nominal Data Greg C Elvers 1 Parametric Statistics The inferential statistics that we have discussed, such as t and ANOVA, are parametric statistics A parametric statistic is a statistic that makes certain

More information

tomamu hokkaido, japan RIDE A DIFFERENT WAVE

tomamu hokkaido, japan RIDE A DIFFERENT WAVE tomamu hokkaido, japan RIDE A DIFFERENT WAVE IT S ALL TAKEN CARE OF SKI PASSES & LESSON INCLUDED NON-SKIING ACTIVITIES CHILDREN S CLUB GOURMET FOOD & OPEN BAR REDEFINED COMFORT NIGHT ENTERTAINMENT Flights

More information

Statistical Analysis for QBIC Genetics Adapted by Ellen G. Dow 2017

Statistical Analysis for QBIC Genetics Adapted by Ellen G. Dow 2017 Statistical Analysis for QBIC Genetics Adapted by Ellen G. Dow 2017 I. χ 2 or chi-square test Objectives: Compare how close an experimentally derived value agrees with an expected value. One method to

More information

ISCA Archive

ISCA Archive ISCA Archive http://www.isca-speech.org/archive ODYSSEY04 - The Speaker and Language Recognition Workshop Toledo, Spain May 3 - June 3, 2004 Analysis of Multitarget Detection for Speaker and Language Recognition*

More information

Unit #4: Collocation analysis in R Stefan Evert 23 June 2016

Unit #4: Collocation analysis in R Stefan Evert 23 June 2016 Unit #4: Collocation analysis in R Stefan Evert 23 June 2016 Contents Co-occurrence data & contingency tables 1 Notation................................................... 1 Contingency tables in R..........................................

More information

Hypothesis testing. Data to decisions

Hypothesis testing. Data to decisions Hypothesis testing Data to decisions The idea Null hypothesis: H 0 : the DGP/population has property P Under the null, a sample statistic has a known distribution If, under that that distribution, the

More information

Hypothesis Tests and Estimation for Population Variances. Copyright 2014 Pearson Education, Inc.

Hypothesis Tests and Estimation for Population Variances. Copyright 2014 Pearson Education, Inc. Hypothesis Tests and Estimation for Population Variances 11-1 Learning Outcomes Outcome 1. Formulate and carry out hypothesis tests for a single population variance. Outcome 2. Develop and interpret confidence

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 26/26: Feature Selection and Exam Overview Paul Ginsparg Cornell University,

More information

# of 6s # of times Test the null hypthesis that the dice are fair at α =.01 significance

# of 6s # of times Test the null hypthesis that the dice are fair at α =.01 significance Practice Final Exam Statistical Methods and Models - Math 410, Fall 2011 December 4, 2011 You may use a calculator, and you may bring in one sheet (8.5 by 11 or A4) of notes. Otherwise closed book. The

More information

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007) FROM: PAGANO, R. R. (007) I. INTRODUCTION: DISTINCTION BETWEEN PARAMETRIC AND NON-PARAMETRIC TESTS Statistical inference tests are often classified as to whether they are parametric or nonparametric Parameter

More information

Exploratory Data Analysis

Exploratory Data Analysis CS448B :: 30 Sep 2010 Exploratory Data Analysis Last Time: Visualization Re-Design Jeffrey Heer Stanford University In-Class Design Exercise Mackinlay s Ranking Task: Analyze and Re-design visualization

More information

Lecture 9. Selected material from: Ch. 12 The analysis of categorical data and goodness of fit tests

Lecture 9. Selected material from: Ch. 12 The analysis of categorical data and goodness of fit tests Lecture 9 Selected material from: Ch. 12 The analysis of categorical data and goodness of fit tests Univariate categorical data Univariate categorical data are best summarized in a one way frequency table.

More information

* * MATHEMATICS (MEI) 4767 Statistics 2 ADVANCED GCE. Monday 25 January 2010 Morning. Duration: 1 hour 30 minutes. Turn over

* * MATHEMATICS (MEI) 4767 Statistics 2 ADVANCED GCE. Monday 25 January 2010 Morning. Duration: 1 hour 30 minutes. Turn over ADVANCED GCE MATHEMATICS (MEI) 4767 Statistics 2 Candidates answer on the Answer Booklet OCR Supplied Materials: 8 page Answer Booklet Graph paper MEI Examination Formulae and Tables (MF2) Other Materials

More information

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F). STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) (b) (c) (d) (e) In 2 2 tables, statistical independence is equivalent

More information

Sample Questions. Chapter 2

Sample Questions. Chapter 2 Chapter 2 P R E P A R I N G T O T A K E T H E T O E I C T E S T Sample Questions With 200 questions, the TOEIC test measures a wide range of English proficiency. The following sample questions do not indicate

More information

Classroom Activity 7 Math 113 Name : 10 pts Intro to Applied Stats

Classroom Activity 7 Math 113 Name : 10 pts Intro to Applied Stats Classroom Activity 7 Math 113 Name : 10 pts Intro to Applied Stats Materials Needed: Bags of popcorn, watch with second hand or microwave with digital timer. Instructions: Follow the instructions on the

More information

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability

More information

Model Fitting using Excel and Gnuplot

Model Fitting using Excel and Gnuplot Model Fitting using Excel and Gnuplot Biochemistry Boot Camp 18 Session #4 Nick Fitzkee nfitzkee@chemistry.msstate.edu Think and Discuss What is a scientific model? 1 Properties of Models Explain an observable

More information

Math 475. Jimin Ding. August 29, Department of Mathematics Washington University in St. Louis jmding/math475/index.

Math 475. Jimin Ding. August 29, Department of Mathematics Washington University in St. Louis   jmding/math475/index. istical A istic istics : istical Department of Mathematics Washington University in St. Louis www.math.wustl.edu/ jmding/math475/index.html August 29, 2013 istical August 29, 2013 1 / 18 istical A istic

More information

INTERVAL ESTIMATION AND HYPOTHESES TESTING

INTERVAL ESTIMATION AND HYPOTHESES TESTING INTERVAL ESTIMATION AND HYPOTHESES TESTING 1. IDEA An interval rather than a point estimate is often of interest. Confidence intervals are thus important in empirical work. To construct interval estimates,

More information

Chapters 10. Hypothesis Testing

Chapters 10. Hypothesis Testing Chapters 10. Hypothesis Testing Some examples of hypothesis testing 1. Toss a coin 100 times and get 62 heads. Is this coin a fair coin? 2. Is the new treatment on blood pressure more effective than the

More information

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition ABSTRACT It is well known that the expectation-maximization (EM) algorithm, commonly used to estimate hidden

More information

Price Discrimination through Refund Contracts in Airlines

Price Discrimination through Refund Contracts in Airlines Introduction Price Discrimination through Refund Contracts in Airlines Paan Jindapon Department of Economics and Finance The University of Texas - Pan American Department of Economics, Finance and Legal

More information

AP Statistics Final Examination Free-Response Questions

AP Statistics Final Examination Free-Response Questions AP Statistics Final Examination Free-Response Questions Name Date Period Section II Part A Questions 1 4 Spend about 50 minutes on this part of the exam (70 points) Directions: You must show all work and

More information

Modeling Environment

Modeling Environment Topic Model Modeling Environment What does it mean to understand/ your environment? Ability to predict Two approaches to ing environment of words and text Latent Semantic Analysis (LSA) Topic Model LSA

More information

Confidence Intervals for Normal Data Spring 2014

Confidence Intervals for Normal Data Spring 2014 Confidence Intervals for Normal Data 18.05 Spring 2014 Agenda Today Review of critical values and quantiles. Computing z, t, χ 2 confidence intervals for normal data. Conceptual view of confidence intervals.

More information

Pearson Education Limited Edinburgh Gate Harlow Essex CM20 2JE England and Associated Companies throughout the world

Pearson Education Limited Edinburgh Gate Harlow Essex CM20 2JE England and Associated Companies throughout the world Pearson Education Limited Edinburgh Gate Harlow Essex CM20 2JE England and Associated Companies throughout the world Visit us on the World Wide Web at: www.pearsoned.co.uk Pearson Education Limited 2014

More information

Discrete Probability Distributions

Discrete Probability Distributions Discrete Probability Distributions Chapter 6 McGraw-Hill/Irwin Copyright 2012 by The McGraw-Hill Companies, Inc. All rights reserved. LO5 Describe and compute probabilities for a binomial distribution.

More information

ANOVA - analysis of variance - used to compare the means of several populations.

ANOVA - analysis of variance - used to compare the means of several populations. 12.1 One-Way Analysis of Variance ANOVA - analysis of variance - used to compare the means of several populations. Assumptions for One-Way ANOVA: 1. Independent samples are taken using a randomized design.

More information

7.1 What is it and why should we care?

7.1 What is it and why should we care? Chapter 7 Probability In this section, we go over some simple concepts from probability theory. We integrate these with ideas from formal language theory in the next chapter. 7.1 What is it and why should

More information

Prenominal Modifier Ordering via MSA. Alignment

Prenominal Modifier Ordering via MSA. Alignment Introduction Prenominal Modifier Ordering via Multiple Sequence Alignment Aaron Dunlop Margaret Mitchell 2 Brian Roark Oregon Health & Science University Portland, OR 2 University of Aberdeen Aberdeen,

More information

Presentation Outline: Haunted Places in North America

Presentation Outline: Haunted Places in North America Name: Date: Presentation Outline: Haunted Places in North America Grade: / 25 Context & Task Halloween is on it s way! What better way to celebrate Halloween than to do a project on haunted places! There

More information

Chapter 26: Comparing Counts (Chi Square)

Chapter 26: Comparing Counts (Chi Square) Chapter 6: Comparing Counts (Chi Square) We ve seen that you can turn a qualitative variable into a quantitative one (by counting the number of successes and failures), but that s a compromise it forces

More information

11 CHI-SQUARED Introduction. Objectives. How random are your numbers? After studying this chapter you should

11 CHI-SQUARED Introduction. Objectives. How random are your numbers? After studying this chapter you should 11 CHI-SQUARED Chapter 11 Chi-squared Objectives After studying this chapter you should be able to use the χ 2 distribution to test if a set of observations fits an appropriate model; know how to calculate

More information

Information-Based Similarity Index

Information-Based Similarity Index Information-Based Similarity Index Albert C.-C. Yang, Ary L Goldberger, C.-K. Peng This document can also be read on-line at http://physionet.org/physiotools/ibs/doc/. The most recent version of the software

More information

Created by T. Madas POISSON DISTRIBUTION. Created by T. Madas

Created by T. Madas POISSON DISTRIBUTION. Created by T. Madas POISSON DISTRIBUTION STANDARD CALCULATIONS Question 1 Accidents occur on a certain stretch of motorway at the rate of three per month. Find the probability that on a given month there will be a) no accidents.

More information

Practice Final Exam ANSWER KEY Chapters 7-13

Practice Final Exam ANSWER KEY Chapters 7-13 Practice Final Exam ANSWER KEY Chapters 7-13 1. 0.0410 (normal) 2. 0.7580 (normal) 3. 66.09 (reverse normal) 4. The critical value for a sample size of n = 6 is 0.888. Since the value of r (0.974) is greater

More information

CBA4 is live in practice mode this week exam mode from Saturday!

CBA4 is live in practice mode this week exam mode from Saturday! Announcements CBA4 is live in practice mode this week exam mode from Saturday! Material covered: Confidence intervals (both cases) 1 sample hypothesis tests (both cases) Hypothesis tests for 2 means as

More information

On max-algebraic models for transportation networks

On max-algebraic models for transportation networks K.U.Leuven Department of Electrical Engineering (ESAT) SISTA Technical report 98-00 On max-algebraic models for transportation networks R. de Vries, B. De Schutter, and B. De Moor If you want to cite this

More information

1 What are probabilities? 2 Sample Spaces. 3 Events and probability spaces

1 What are probabilities? 2 Sample Spaces. 3 Events and probability spaces 1 What are probabilities? There are two basic schools of thought as to the philosophical status of probabilities. One school of thought, the frequentist school, considers the probability of an event to

More information

Logistic Regression. Some slides from Craig Burkett. STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy

Logistic Regression. Some slides from Craig Burkett. STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy Logistic Regression Some slides from Craig Burkett STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy Titanic Survival Case Study The RMS Titanic A British passenger liner Collided

More information

N-gram N N-gram. N-gram. Detection and Correction for Errors in Hiragana Sequences by a Hiragana Character N-gram.

N-gram N N-gram. N-gram. Detection and Correction for Errors in Hiragana Sequences by a Hiragana Character N-gram. Vol. 40 No. 6 June 1999 N N 3 N N 5 36-gram 5 4-gram Detection and Correction for Errors in Hiragana Sequences by a Hiragana Character Hiroyuki Shinnou In this paper, we propose the hiragana character

More information

East Greenland Coming to Greenland it is all about the experience of the Arctic!

East Greenland Coming to Greenland it is all about the experience of the Arctic! East Greenland 2015 The heliski area is on the East Coast in a region named Angmagssalik. The mountain ranges are between large fiords coming off the Greenland ice cap. The highest peaks in the region

More information

cmp-lg/ Aug 1996

cmp-lg/ Aug 1996 Automatic Alignment of nglish-hinese Bilingual Texts of NS News Donghua Xu hew Lim Tan xudonghu@iscs.nus.sg tancl@iscs.nus.sg Department of Information System and omputer Science, National University of

More information

N-gram Language Modeling

N-gram Language Modeling N-gram Language Modeling Outline: Statistical Language Model (LM) Intro General N-gram models Basic (non-parametric) n-grams Class LMs Mixtures Part I: Statistical Language Model (LM) Intro What is a statistical

More information

Estimation of An Event Occurrence for LOPA Studies. Randy Freeman S&PP Consulting Houston, TX

Estimation of An Event Occurrence for LOPA Studies. Randy Freeman S&PP Consulting Houston, TX Estimation of An Event Occurrence for LOPA Studies Randy Freeman S&PP Consulting Houston, TX 77041 713 408 0357 rafree@yahoo.com 1 Problem Your LOPA team members tell you that the initiating event of concern

More information

Mathematics (Project Maths Phase 2)

Mathematics (Project Maths Phase 2) 2013. M230 Coimisiún na Scrúduithe Stáit State Examinations Commission Leaving Certificate Examination 2013 Mathematics (Project Maths Phase 2) Paper 2 Higher Level Monday 10 June Morning 9:30 12:00 300

More information

CSCI3381-Cryptography

CSCI3381-Cryptography CSCI3381-Cryptography Lecture 2: Classical Cryptosystems September 3, 2014 This describes some cryptographic systems in use before the advent of computers. All of these methods are quite insecure, from

More information

Trip Generation Calculations

Trip Generation Calculations Trip Generation Calculations A HORIZON AIR ALLEGIANT AI R EMPLOYEES Daily Trip Generation Daily Trip Generation Daily Trip Generation Directional Flights (per day): 20 Directional Flights (per day): 2.8

More information

Genre-specific properties of adverbial clauses in English academic writing and newspaper texts. Elma Kerz & Daniel Wiechmann

Genre-specific properties of adverbial clauses in English academic writing and newspaper texts. Elma Kerz & Daniel Wiechmann Genre-specific properties of adverbial clauses in English academic writing and newspaper texts Elma Kerz & Daniel Wiechmann A constructionist perspective on patterns Schematicity Complex sentence constructions

More information

Binomial and Poisson Probability Distributions

Binomial and Poisson Probability Distributions Binomial and Poisson Probability Distributions Esra Akdeniz March 3, 2016 Bernoulli Random Variable Any random variable whose only possible values are 0 or 1 is called a Bernoulli random variable. What

More information

Variable Latent Semantic Indexing

Variable Latent Semantic Indexing Variable Latent Semantic Indexing Prabhakar Raghavan Yahoo! Research Sunnyvale, CA November 2005 Joint work with A. Dasgupta, R. Kumar, A. Tomkins. Yahoo! Research. Outline 1 Introduction 2 Background

More information

Probability 5-4 The Multiplication Rules and Conditional Probability

Probability 5-4 The Multiplication Rules and Conditional Probability Outline Lecture 8 5-1 Introduction 5-2 Sample Spaces and 5-3 The Addition Rules for 5-4 The Multiplication Rules and Conditional 5-11 Introduction 5-11 Introduction as a general concept can be defined

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabás Póczos Contents Decision Trees: Definition + Motivation Algorithm for Learning Decision Trees Entropy, Mutual Information, Information

More information

10.2: The Chi Square Test for Goodness of Fit

10.2: The Chi Square Test for Goodness of Fit 10.2: The Chi Square Test for Goodness of Fit We can perform a hypothesis test to determine whether the distribution of a single categorical variable is following a proposed distribution. We call this

More information

Text mining and natural language analysis. Jefrey Lijffijt

Text mining and natural language analysis. Jefrey Lijffijt Text mining and natural language analysis Jefrey Lijffijt PART I: Introduction to Text Mining Why text mining The amount of text published on paper, on the web, and even within companies is inconceivably

More information

Travel and Transportation

Travel and Transportation A) Locations: B) Time 1) in the front 4) by the window 7) earliest 2) in the middle 5) on the isle 8) first 3) in the back 6) by the door 9) next 10) last 11) latest B) Actions: 1) Get on 2) Get off 3)

More information

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur Lecture No. # 36 Sampling Distribution and Parameter Estimation

More information

STATISTICS SYLLABUS UNIT I

STATISTICS SYLLABUS UNIT I STATISTICS SYLLABUS UNIT I (Probability Theory) Definition Classical and axiomatic approaches.laws of total and compound probability, conditional probability, Bayes Theorem. Random variable and its distribution

More information

Performance Evaluation and Comparison

Performance Evaluation and Comparison Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Cross Validation and Resampling 3 Interval Estimation

More information

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F). STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) T In 2 2 tables, statistical independence is equivalent to a population

More information

Example. χ 2 = Continued on the next page. All cells

Example. χ 2 = Continued on the next page. All cells Section 11.1 Chi Square Statistic k Categories 1 st 2 nd 3 rd k th Total Observed Frequencies O 1 O 2 O 3 O k n Expected Frequencies E 1 E 2 E 3 E k n O 1 + O 2 + O 3 + + O k = n E 1 + E 2 + E 3 + + E

More information

Lecture 28 Chi-Square Analysis

Lecture 28 Chi-Square Analysis Lecture 28 STAT 225 Introduction to Probability Models April 23, 2014 Whitney Huang Purdue University 28.1 χ 2 test for For a given contingency table, we want to test if two have a relationship or not

More information

Lesson 7.9 = ( _ + _ ) + _ = ( _ ) + _ = _ ALGEBRA

Lesson 7.9 = ( _ + _ ) + _ = ( _ ) + _ = _ ALGEBRA Name Fractions and Properties of Addition Essential Question How can you add fractions with like denominators using the properties of addition? connect The Associative and Commutative Properties of Addition

More information

Mark Scheme (Results) June 2008

Mark Scheme (Results) June 2008 Mark Scheme (Results) June 008 GCE GCE Mathematics (669101) Edexcel Limited. Registered in England and Wales No. 4496750 Registered Office: One90 High Holborn, London WC1V 7BH 1 June 008 6691 Statistics

More information

Path Analysis. PRE 906: Structural Equation Modeling Lecture #5 February 18, PRE 906, SEM: Lecture 5 - Path Analysis

Path Analysis. PRE 906: Structural Equation Modeling Lecture #5 February 18, PRE 906, SEM: Lecture 5 - Path Analysis Path Analysis PRE 906: Structural Equation Modeling Lecture #5 February 18, 2015 PRE 906, SEM: Lecture 5 - Path Analysis Key Questions for Today s Lecture What distinguishes path models from multivariate

More information

HW Solution 5 Due: Sep 13

HW Solution 5 Due: Sep 13 ECS 315: Probability and Random Processes 01/1 HW Solution 5 Due: Sep 13 Lecturer: Prapun Suksompong, Ph.D. Instructions (a) ONE part of a question will be graded (5 pt). Of course, you do not know which

More information

Sequential Analysis & Testing Multiple Hypotheses,

Sequential Analysis & Testing Multiple Hypotheses, Sequential Analysis & Testing Multiple Hypotheses, CS57300 - Data Mining Spring 2016 Instructor: Bruno Ribeiro 2016 Bruno Ribeiro This Class: Sequential Analysis Testing Multiple Hypotheses Nonparametric

More information

Basic Verification Concepts

Basic Verification Concepts Basic Verification Concepts Barbara Brown National Center for Atmospheric Research Boulder Colorado USA bgb@ucar.edu Basic concepts - outline What is verification? Why verify? Identifying verification

More information

A Bayesian mixture model for term re-occurrence and burstiness

A Bayesian mixture model for term re-occurrence and burstiness A Bayesian mixture model for term re-occurrence and burstiness Avik Sarkar 1, Paul H Garthwaite 2, Anne De Roeck 1 1 Department of Computing, 2 Department of Statistics The Open University Milton Keynes,

More information

Time: 1 hour 30 minutes

Time: 1 hour 30 minutes Paper Reference(s) 6684/0 Edexcel GCE Statistics S Silver Level S Time: hour 30 minutes Materials required for examination papers Mathematical Formulae (Green) Items included with question Nil Candidates

More information

Model Estimation Example

Model Estimation Example Ronald H. Heck 1 EDEP 606: Multivariate Methods (S2013) April 7, 2013 Model Estimation Example As we have moved through the course this semester, we have encountered the concept of model estimation. Discussions

More information