Are you talking Bernoulli to me? Comparing methods of assessing word frequencies
|
|
- Owen Ford
- 5 years ago
- Views:
Transcription
1 Comparing methods of assessing word frequencies * Joint work with Panagiotis Papapetrou *, Tanja Säily **, Kai Puolamäki *, Terttu Nevalainen **, and Heikki Mannila * * Department of Information and Computer Science, Aalto University ** Department of Modern Languages, University of Helsinki
2 Problem setting Given two corpora S and T Find all words that are significantly more frequent in S than in T, or vice versa Word Freq in S Freq in T mr Total Is this statistically significant? 2
3 Motivation Find differences between groups Speaker groups of different ages S = 2 3, T = 4 5 Genres S = newspaper, T = magazines Author gender S = male, T = female Time periods S = , T =
4 Data Text = sequence of words Corpus of Early English Correspondence (CEEC) Version with normalized spelling Compare periods and M words 3+ letters Preprocessing: remove punctuation from all words 4
5 Problem setting Input: Two corpora: S and T A significance threshold: α ( < α < 1) Word q is significant at level α if and only if p α p gives the probability for S and T having equal means Two-tailed tests: test direction separately 5
6 Log-likelihood ratio test (Dunning 1993) Assume all words are independent Bag-of-words model Significance test using 2x2 table Bin(k,n, fr) = n fr k 1 fr k ( ) n k λ = Bin(k S,n S, fr S+T ) Bin(k T,n T, fr S+T ) Bin(k S,n S, fr S ) Bin(k T,n T, fr T ) Word Freq in S Freq in T mr Total log λ ~ χ 2 p =.4 6
7 Bag-of-words model (log-likelihood ratio test, χ 2 -test, Fisher s exact test, binomial test) Assume all words are independent However: texts have structure! Why the bag-of-words model then? Mathematically simple Computationally efficient Core questions: Can we provide more realistic models? Does it matter? 7
8 Many words are bursty for (n = 8792) Data: British National Corpus, 449 texts I (n = 86897) Number of texts Normalized frequency Normalized frequency Frequency distribution differs per word Depends on frequency and word type 8
9 Proposed method 1: Inter-arrival times (Lijffijt et al. 211) Count space between consecutive occurrences (of and) Finnair believes that it will be able to resume its scheduled service to and from New York on Monday, after two days of cancellations caused by hurricane Irene. All three airports serving New York City have been closed because of the hurricane and Finnair was forced to cancel flights on Saturday and Sunday. The airline is not certain when its scheduled service can be resumed, but the assumption is that Monday afternoon's flight from Helsinki will depart. Some Finnair passengers whose final destination is not New York have been rerouted and some have delayed travel plans. The company has also offered ticket holders a refund. YLE IAT and = {29, 9, 39, 29} Hypothesis: this captures the behavior pattern of words Altmann et al. 29: Predict word class based on IAT 9
10 Proposed method 1: Inter-arrival times (Lijffijt et al. 211) Count space between consecutive occurrences IAT distribution Resampling of S and T 1. Pick random first index 2. Sample random inter-arrival time 3. Repeat 2. until size of S exceeded X X X S Produce random corpora: S 1,, S N and T 1,, T N N I( freq(q,s i ) freq(q,t i )) p p 2 = 1+ N 2 min(p 1,1 p 1 ) 1 = i=1 N 1+ N 1
11 Proposed method 2: Bootstrapping (Lijffijt et al. 211) Resampling based on word frequency distribution Number of texts equal to number of texts in S Produce random corpora: S 1,, S N and T 1,, T N p 1 = N i=1 I( freq(q,s i ) freq(q,t i )) N p 2 = 1+ N 2 min(p 1,1 p 1 ) 1+ N 11
12 Comparison for mr Word Freq in S Freq in T mr Total p χ2 =.43 p log-likelihood =.4 p IAT =.747 p bootstrap =.143 Maybe the difference is not so significant! 12
13 % texts 6 4 Total freq : 1637 (.29 %) Total freq : 284 (.32 %) % mr
14 Experiments Compare four methods χ 2 -test Log-likelihood ratio test Inter-arrival time test Bootstrap test Compute p-values for all words Hypothesis: mean frequency in and are equal 14
15 1.8 p value Log likelihood p value Inter arrival time
16 1.8 p value Log likelihood p value Bootstrap
17 1.8 p value Log likelihood p value Chi squared
18 1.8 p value Inter arrival time p value Chi squared
19 1.8 p value Bootstrap p value Chi squared
20 1.8 p value Bootstrap p value Inter arrival time
21 Examples of words with > 1-fold difference Word % % p LL p Boot him we mr horse prince goods patent pound merchant li
22 Conclusion Bag-of-words model poorly represents frequency distributions New methods: inter-arrival times and bootstrap method (Lijffijt et al. 211) Take into account burstiness (or dispersion) of words More conservative p-values Not covered in presentation: - Correction for multiple hypotheses Future work: - More in-depth analysis of differences between methods - Dependency between words 22
23 References Altmann, E.G., Pierrehumbert, J.B., Motter, A.E. (29). Beyond word frequency: Bursts, lulls, and scaling in the temporal distribution of words, PLoS ONE, 4(11):e7678. Dunning, T. (1993). Accurate Methods for the Statistics of Surprise and Coincidence, Computational Linguistics, 19: Lijffijt, J., Papapetrou, P., Puolamäki, K., Mannila, H. (211). Analyzing Word Frequencies in Large Text Corpora using Inter-arrival Times and Bootstrapping. In ECML PKDD 211, Part II, pp
24 Most frequent significant words Word % % p LL p Boot my <.1.1 that <.1.1 your <.1.1 it <.1.1 is <.1.1 and <.1.1 with <.1.1 but <.1.1 in < words significant at α =.5 in all four methods 24
Are you talking Bernoulli to me? Significance testing and burstiness of words in text corpora
Significance testing and burstiness of words in text corpora Doctoral Student and Researcher Department of Information and Computer Science, Aalto University Helsinki Institute for Information Technology
More informationApplied Natural Language Processing
Applied Natural Language Processing Info 256 Lecture 3: Finding Distinctive Terms (Jan 29, 2019) David Bamman, UC Berkeley https://www.nytimes.com/interactive/ 2017/11/07/upshot/modern-love-what-wewrite-when-we-write-about-love.html
More informationLing 289 Contingency Table Statistics
Ling 289 Contingency Table Statistics Roger Levy and Christopher Manning This is a summary of the material that we ve covered on contingency tables. Contingency tables: introduction Odds ratios Counting,
More informationAnalysis of Variance
Analysis of Variance Chapter 12 McGraw-Hill/Irwin Copyright 2013 by The McGraw-Hill Companies, Inc. All rights reserved. Learning Objectives LO 12-1 List the characteristics of the F distribution and locate
More informationContingency Tables. Safety equipment in use Fatal Non-fatal Total. None 1, , ,128 Seat belt , ,878
Contingency Tables I. Definition & Examples. A) Contingency tables are tables where we are looking at two (or more - but we won t cover three or more way tables, it s way too complicated) factors, each
More informationWe know from STAT.1030 that the relevant test statistic for equality of proportions is:
2. Chi 2 -tests for equality of proportions Introduction: Two Samples Consider comparing the sample proportions p 1 and p 2 in independent random samples of size n 1 and n 2 out of two populations which
More informationTesting Independence
Testing Independence Dipankar Bandyopadhyay Department of Biostatistics, Virginia Commonwealth University BIOS 625: Categorical Data & GLM 1/50 Testing Independence Previously, we looked at RR = OR = 1
More informationPredicting flight on-time performance
1 Predicting flight on-time performance Arjun Mathur, Aaron Nagao, Kenny Ng I. INTRODUCTION Time is money, and delayed flights are a frequent cause of frustration for both travellers and airline companies.
More informationN-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24
L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,
More informationNatural Language Processing SoSe Words and Language Model
Natural Language Processing SoSe 2016 Words and Language Model Dr. Mariana Neves May 2nd, 2016 Outline 2 Words Language Model Outline 3 Words Language Model Tokenization Separation of words in a sentence
More informationChi-Squared Tests. Semester 1. Chi-Squared Tests
Semester 1 Goodness of Fit Up to now, we have tested hypotheses concerning the values of population parameters such as the population mean or proportion. We have not considered testing hypotheses about
More informationNatural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)
Natural Language Processing SoSe 2015 Language Modelling Dr. Mariana Neves April 20th, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline 2 Motivation Estimation Evaluation Smoothing Outline 3 Motivation
More informationLanguage as a Stochastic Process
CS769 Spring 2010 Advanced Natural Language Processing Language as a Stochastic Process Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Basic Statistics for NLP Pick an arbitrary letter x at random from any
More information13.1 Categorical Data and the Multinomial Experiment
Chapter 13 Categorical Data Analysis 13.1 Categorical Data and the Multinomial Experiment Recall Variable: (numerical) variable (i.e. # of students, temperature, height,). (non-numerical, categorical)
More informationHypothesis Testing: Chi-Square Test 1
Hypothesis Testing: Chi-Square Test 1 November 9, 2017 1 HMS, 2017, v1.0 Chapter References Diez: Chapter 6.3 Navidi, Chapter 6.10 Chapter References 2 Chi-square Distributions Let X 1, X 2,... X n be
More informationMAT2377. Ali Karimnezhad. Version December 13, Ali Karimnezhad
MAT2377 Ali Karimnezhad Version December 13, 2016 Ali Karimnezhad Comments These slides cover material from Chapter 4. In class, I may use a blackboard. I recommend reading these slides before you come
More informationExercise 1: Basics of probability calculus
: Basics of probability calculus Stig-Arne Grönroos Department of Signal Processing and Acoustics Aalto University, School of Electrical Engineering stig-arne.gronroos@aalto.fi [21.01.2016] Ex 1.1: Conditional
More informationOptimum parameter selection for K.L.D. based Authorship Attribution for Gujarati
Optimum parameter selection for K.L.D. based Authorship Attribution for Gujarati Parth Mehta DA-IICT, Gandhinagar parth.mehta126@gmail.com Prasenjit Majumder DA-IICT, Gandhinagar prasenjit.majumder@gmail.com
More informationRon Heck, Fall Week 3: Notes Building a Two-Level Model
Ron Heck, Fall 2011 1 EDEP 768E: Seminar on Multilevel Modeling rev. 9/6/2011@11:27pm Week 3: Notes Building a Two-Level Model We will build a model to explain student math achievement using student-level
More informationContingency Tables. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels.
Contingency Tables Definition & Examples. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels. (Using more than two factors gets complicated,
More informationLecture 41 Sections Wed, Nov 12, 2008
Lecture 41 Sections 14.1-14.3 Hampden-Sydney College Wed, Nov 12, 2008 Outline 1 2 3 4 5 6 7 one-proportion test that we just studied allows us to test a hypothesis concerning one proportion, or two categories,
More informationPROBABILITY.
PROBABILITY PROBABILITY(Basic Terminology) Random Experiment: If in each trial of an experiment conducted under identical conditions, the outcome is not unique, but may be any one of the possible outcomes,
More informationThis does not cover everything on the final. Look at the posted practice problems for other topics.
Class 7: Review Problems for Final Exam 8.5 Spring 7 This does not cover everything on the final. Look at the posted practice problems for other topics. To save time in class: set up, but do not carry
More informationIntroduction to Statistical Data Analysis Lecture 7: The Chi-Square Distribution
Introduction to Statistical Data Analysis Lecture 7: The Chi-Square Distribution James V. Lambers Department of Mathematics The University of Southern Mississippi James V. Lambers Statistical Data Analysis
More informationAoCMM #692. October 11, 2017
AoCMM #692 October 11, 2017 #692 1 Contents 1 Summary 3 1.1 Interpretation and Methods of roblem 1................ 3 1.2 Interpretation and Methods of roblem 2a............... 4 1.3 Interpretation and
More informationStatistics 3858 : Contingency Tables
Statistics 3858 : Contingency Tables 1 Introduction Before proceeding with this topic the student should review generalized likelihood ratios ΛX) for multinomial distributions, its relation to Pearson
More informationReport on Wiarton Keppel International Airport for Georgian Bluffs Prepared by: Alec Dare, BA, OCGC Research Analyst, Centre for Applied Research and
Report on Wiarton Keppel International Airport for Georgian Bluffs Prepared by: Alec Dare, BA, OCGC Research Analyst, Centre for Applied Research and Innovation Georgian College March 2018 Contents 1.
More informationGROUPED DATA E.G. FOR SAMPLE OF RAW DATA (E.G. 4, 12, 7, 5, MEAN G x / n STANDARD DEVIATION MEDIAN AND QUARTILES STANDARD DEVIATION
FOR SAMPLE OF RAW DATA (E.G. 4, 1, 7, 5, 11, 6, 9, 7, 11, 5, 4, 7) BE ABLE TO COMPUTE MEAN G / STANDARD DEVIATION MEDIAN AND QUARTILES Σ ( Σ) / 1 GROUPED DATA E.G. AGE FREQ. 0-9 53 10-19 4...... 80-89
More informationNominal Data. Parametric Statistics. Nonparametric Statistics. Parametric vs Nonparametric Tests. Greg C Elvers
Nominal Data Greg C Elvers 1 Parametric Statistics The inferential statistics that we have discussed, such as t and ANOVA, are parametric statistics A parametric statistic is a statistic that makes certain
More informationtomamu hokkaido, japan RIDE A DIFFERENT WAVE
tomamu hokkaido, japan RIDE A DIFFERENT WAVE IT S ALL TAKEN CARE OF SKI PASSES & LESSON INCLUDED NON-SKIING ACTIVITIES CHILDREN S CLUB GOURMET FOOD & OPEN BAR REDEFINED COMFORT NIGHT ENTERTAINMENT Flights
More informationStatistical Analysis for QBIC Genetics Adapted by Ellen G. Dow 2017
Statistical Analysis for QBIC Genetics Adapted by Ellen G. Dow 2017 I. χ 2 or chi-square test Objectives: Compare how close an experimentally derived value agrees with an expected value. One method to
More informationISCA Archive
ISCA Archive http://www.isca-speech.org/archive ODYSSEY04 - The Speaker and Language Recognition Workshop Toledo, Spain May 3 - June 3, 2004 Analysis of Multitarget Detection for Speaker and Language Recognition*
More informationUnit #4: Collocation analysis in R Stefan Evert 23 June 2016
Unit #4: Collocation analysis in R Stefan Evert 23 June 2016 Contents Co-occurrence data & contingency tables 1 Notation................................................... 1 Contingency tables in R..........................................
More informationHypothesis testing. Data to decisions
Hypothesis testing Data to decisions The idea Null hypothesis: H 0 : the DGP/population has property P Under the null, a sample statistic has a known distribution If, under that that distribution, the
More informationHypothesis Tests and Estimation for Population Variances. Copyright 2014 Pearson Education, Inc.
Hypothesis Tests and Estimation for Population Variances 11-1 Learning Outcomes Outcome 1. Formulate and carry out hypothesis tests for a single population variance. Outcome 2. Develop and interpret confidence
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 26/26: Feature Selection and Exam Overview Paul Ginsparg Cornell University,
More information# of 6s # of times Test the null hypthesis that the dice are fair at α =.01 significance
Practice Final Exam Statistical Methods and Models - Math 410, Fall 2011 December 4, 2011 You may use a calculator, and you may bring in one sheet (8.5 by 11 or A4) of notes. Otherwise closed book. The
More informationCHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)
FROM: PAGANO, R. R. (007) I. INTRODUCTION: DISTINCTION BETWEEN PARAMETRIC AND NON-PARAMETRIC TESTS Statistical inference tests are often classified as to whether they are parametric or nonparametric Parameter
More informationExploratory Data Analysis
CS448B :: 30 Sep 2010 Exploratory Data Analysis Last Time: Visualization Re-Design Jeffrey Heer Stanford University In-Class Design Exercise Mackinlay s Ranking Task: Analyze and Re-design visualization
More informationLecture 9. Selected material from: Ch. 12 The analysis of categorical data and goodness of fit tests
Lecture 9 Selected material from: Ch. 12 The analysis of categorical data and goodness of fit tests Univariate categorical data Univariate categorical data are best summarized in a one way frequency table.
More information* * MATHEMATICS (MEI) 4767 Statistics 2 ADVANCED GCE. Monday 25 January 2010 Morning. Duration: 1 hour 30 minutes. Turn over
ADVANCED GCE MATHEMATICS (MEI) 4767 Statistics 2 Candidates answer on the Answer Booklet OCR Supplied Materials: 8 page Answer Booklet Graph paper MEI Examination Formulae and Tables (MF2) Other Materials
More informationSTA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).
STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) (b) (c) (d) (e) In 2 2 tables, statistical independence is equivalent
More informationSample Questions. Chapter 2
Chapter 2 P R E P A R I N G T O T A K E T H E T O E I C T E S T Sample Questions With 200 questions, the TOEIC test measures a wide range of English proficiency. The following sample questions do not indicate
More informationClassroom Activity 7 Math 113 Name : 10 pts Intro to Applied Stats
Classroom Activity 7 Math 113 Name : 10 pts Intro to Applied Stats Materials Needed: Bags of popcorn, watch with second hand or microwave with digital timer. Instructions: Follow the instructions on the
More informationPROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS
PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability
More informationModel Fitting using Excel and Gnuplot
Model Fitting using Excel and Gnuplot Biochemistry Boot Camp 18 Session #4 Nick Fitzkee nfitzkee@chemistry.msstate.edu Think and Discuss What is a scientific model? 1 Properties of Models Explain an observable
More informationMath 475. Jimin Ding. August 29, Department of Mathematics Washington University in St. Louis jmding/math475/index.
istical A istic istics : istical Department of Mathematics Washington University in St. Louis www.math.wustl.edu/ jmding/math475/index.html August 29, 2013 istical August 29, 2013 1 / 18 istical A istic
More informationINTERVAL ESTIMATION AND HYPOTHESES TESTING
INTERVAL ESTIMATION AND HYPOTHESES TESTING 1. IDEA An interval rather than a point estimate is often of interest. Confidence intervals are thus important in empirical work. To construct interval estimates,
More informationChapters 10. Hypothesis Testing
Chapters 10. Hypothesis Testing Some examples of hypothesis testing 1. Toss a coin 100 times and get 62 heads. Is this coin a fair coin? 2. Is the new treatment on blood pressure more effective than the
More informationExperiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition
Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition ABSTRACT It is well known that the expectation-maximization (EM) algorithm, commonly used to estimate hidden
More informationPrice Discrimination through Refund Contracts in Airlines
Introduction Price Discrimination through Refund Contracts in Airlines Paan Jindapon Department of Economics and Finance The University of Texas - Pan American Department of Economics, Finance and Legal
More informationAP Statistics Final Examination Free-Response Questions
AP Statistics Final Examination Free-Response Questions Name Date Period Section II Part A Questions 1 4 Spend about 50 minutes on this part of the exam (70 points) Directions: You must show all work and
More informationModeling Environment
Topic Model Modeling Environment What does it mean to understand/ your environment? Ability to predict Two approaches to ing environment of words and text Latent Semantic Analysis (LSA) Topic Model LSA
More informationConfidence Intervals for Normal Data Spring 2014
Confidence Intervals for Normal Data 18.05 Spring 2014 Agenda Today Review of critical values and quantiles. Computing z, t, χ 2 confidence intervals for normal data. Conceptual view of confidence intervals.
More informationPearson Education Limited Edinburgh Gate Harlow Essex CM20 2JE England and Associated Companies throughout the world
Pearson Education Limited Edinburgh Gate Harlow Essex CM20 2JE England and Associated Companies throughout the world Visit us on the World Wide Web at: www.pearsoned.co.uk Pearson Education Limited 2014
More informationDiscrete Probability Distributions
Discrete Probability Distributions Chapter 6 McGraw-Hill/Irwin Copyright 2012 by The McGraw-Hill Companies, Inc. All rights reserved. LO5 Describe and compute probabilities for a binomial distribution.
More informationANOVA - analysis of variance - used to compare the means of several populations.
12.1 One-Way Analysis of Variance ANOVA - analysis of variance - used to compare the means of several populations. Assumptions for One-Way ANOVA: 1. Independent samples are taken using a randomized design.
More information7.1 What is it and why should we care?
Chapter 7 Probability In this section, we go over some simple concepts from probability theory. We integrate these with ideas from formal language theory in the next chapter. 7.1 What is it and why should
More informationPrenominal Modifier Ordering via MSA. Alignment
Introduction Prenominal Modifier Ordering via Multiple Sequence Alignment Aaron Dunlop Margaret Mitchell 2 Brian Roark Oregon Health & Science University Portland, OR 2 University of Aberdeen Aberdeen,
More informationPresentation Outline: Haunted Places in North America
Name: Date: Presentation Outline: Haunted Places in North America Grade: / 25 Context & Task Halloween is on it s way! What better way to celebrate Halloween than to do a project on haunted places! There
More informationChapter 26: Comparing Counts (Chi Square)
Chapter 6: Comparing Counts (Chi Square) We ve seen that you can turn a qualitative variable into a quantitative one (by counting the number of successes and failures), but that s a compromise it forces
More information11 CHI-SQUARED Introduction. Objectives. How random are your numbers? After studying this chapter you should
11 CHI-SQUARED Chapter 11 Chi-squared Objectives After studying this chapter you should be able to use the χ 2 distribution to test if a set of observations fits an appropriate model; know how to calculate
More informationInformation-Based Similarity Index
Information-Based Similarity Index Albert C.-C. Yang, Ary L Goldberger, C.-K. Peng This document can also be read on-line at http://physionet.org/physiotools/ibs/doc/. The most recent version of the software
More informationCreated by T. Madas POISSON DISTRIBUTION. Created by T. Madas
POISSON DISTRIBUTION STANDARD CALCULATIONS Question 1 Accidents occur on a certain stretch of motorway at the rate of three per month. Find the probability that on a given month there will be a) no accidents.
More informationPractice Final Exam ANSWER KEY Chapters 7-13
Practice Final Exam ANSWER KEY Chapters 7-13 1. 0.0410 (normal) 2. 0.7580 (normal) 3. 66.09 (reverse normal) 4. The critical value for a sample size of n = 6 is 0.888. Since the value of r (0.974) is greater
More informationCBA4 is live in practice mode this week exam mode from Saturday!
Announcements CBA4 is live in practice mode this week exam mode from Saturday! Material covered: Confidence intervals (both cases) 1 sample hypothesis tests (both cases) Hypothesis tests for 2 means as
More informationOn max-algebraic models for transportation networks
K.U.Leuven Department of Electrical Engineering (ESAT) SISTA Technical report 98-00 On max-algebraic models for transportation networks R. de Vries, B. De Schutter, and B. De Moor If you want to cite this
More information1 What are probabilities? 2 Sample Spaces. 3 Events and probability spaces
1 What are probabilities? There are two basic schools of thought as to the philosophical status of probabilities. One school of thought, the frequentist school, considers the probability of an event to
More informationLogistic Regression. Some slides from Craig Burkett. STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy
Logistic Regression Some slides from Craig Burkett STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy Titanic Survival Case Study The RMS Titanic A British passenger liner Collided
More informationN-gram N N-gram. N-gram. Detection and Correction for Errors in Hiragana Sequences by a Hiragana Character N-gram.
Vol. 40 No. 6 June 1999 N N 3 N N 5 36-gram 5 4-gram Detection and Correction for Errors in Hiragana Sequences by a Hiragana Character Hiroyuki Shinnou In this paper, we propose the hiragana character
More informationEast Greenland Coming to Greenland it is all about the experience of the Arctic!
East Greenland 2015 The heliski area is on the East Coast in a region named Angmagssalik. The mountain ranges are between large fiords coming off the Greenland ice cap. The highest peaks in the region
More informationcmp-lg/ Aug 1996
Automatic Alignment of nglish-hinese Bilingual Texts of NS News Donghua Xu hew Lim Tan xudonghu@iscs.nus.sg tancl@iscs.nus.sg Department of Information System and omputer Science, National University of
More informationN-gram Language Modeling
N-gram Language Modeling Outline: Statistical Language Model (LM) Intro General N-gram models Basic (non-parametric) n-grams Class LMs Mixtures Part I: Statistical Language Model (LM) Intro What is a statistical
More informationEstimation of An Event Occurrence for LOPA Studies. Randy Freeman S&PP Consulting Houston, TX
Estimation of An Event Occurrence for LOPA Studies Randy Freeman S&PP Consulting Houston, TX 77041 713 408 0357 rafree@yahoo.com 1 Problem Your LOPA team members tell you that the initiating event of concern
More informationMathematics (Project Maths Phase 2)
2013. M230 Coimisiún na Scrúduithe Stáit State Examinations Commission Leaving Certificate Examination 2013 Mathematics (Project Maths Phase 2) Paper 2 Higher Level Monday 10 June Morning 9:30 12:00 300
More informationCSCI3381-Cryptography
CSCI3381-Cryptography Lecture 2: Classical Cryptosystems September 3, 2014 This describes some cryptographic systems in use before the advent of computers. All of these methods are quite insecure, from
More informationTrip Generation Calculations
Trip Generation Calculations A HORIZON AIR ALLEGIANT AI R EMPLOYEES Daily Trip Generation Daily Trip Generation Daily Trip Generation Directional Flights (per day): 20 Directional Flights (per day): 2.8
More informationGenre-specific properties of adverbial clauses in English academic writing and newspaper texts. Elma Kerz & Daniel Wiechmann
Genre-specific properties of adverbial clauses in English academic writing and newspaper texts Elma Kerz & Daniel Wiechmann A constructionist perspective on patterns Schematicity Complex sentence constructions
More informationBinomial and Poisson Probability Distributions
Binomial and Poisson Probability Distributions Esra Akdeniz March 3, 2016 Bernoulli Random Variable Any random variable whose only possible values are 0 or 1 is called a Bernoulli random variable. What
More informationVariable Latent Semantic Indexing
Variable Latent Semantic Indexing Prabhakar Raghavan Yahoo! Research Sunnyvale, CA November 2005 Joint work with A. Dasgupta, R. Kumar, A. Tomkins. Yahoo! Research. Outline 1 Introduction 2 Background
More informationProbability 5-4 The Multiplication Rules and Conditional Probability
Outline Lecture 8 5-1 Introduction 5-2 Sample Spaces and 5-3 The Addition Rules for 5-4 The Multiplication Rules and Conditional 5-11 Introduction 5-11 Introduction as a general concept can be defined
More informationIntroduction to Machine Learning CMU-10701
Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabás Póczos Contents Decision Trees: Definition + Motivation Algorithm for Learning Decision Trees Entropy, Mutual Information, Information
More information10.2: The Chi Square Test for Goodness of Fit
10.2: The Chi Square Test for Goodness of Fit We can perform a hypothesis test to determine whether the distribution of a single categorical variable is following a proposed distribution. We call this
More informationText mining and natural language analysis. Jefrey Lijffijt
Text mining and natural language analysis Jefrey Lijffijt PART I: Introduction to Text Mining Why text mining The amount of text published on paper, on the web, and even within companies is inconceivably
More informationTravel and Transportation
A) Locations: B) Time 1) in the front 4) by the window 7) earliest 2) in the middle 5) on the isle 8) first 3) in the back 6) by the door 9) next 10) last 11) latest B) Actions: 1) Get on 2) Get off 3)
More informationProbability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur
Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur Lecture No. # 36 Sampling Distribution and Parameter Estimation
More informationSTATISTICS SYLLABUS UNIT I
STATISTICS SYLLABUS UNIT I (Probability Theory) Definition Classical and axiomatic approaches.laws of total and compound probability, conditional probability, Bayes Theorem. Random variable and its distribution
More informationPerformance Evaluation and Comparison
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Cross Validation and Resampling 3 Interval Estimation
More informationSTA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).
STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) T In 2 2 tables, statistical independence is equivalent to a population
More informationExample. χ 2 = Continued on the next page. All cells
Section 11.1 Chi Square Statistic k Categories 1 st 2 nd 3 rd k th Total Observed Frequencies O 1 O 2 O 3 O k n Expected Frequencies E 1 E 2 E 3 E k n O 1 + O 2 + O 3 + + O k = n E 1 + E 2 + E 3 + + E
More informationLecture 28 Chi-Square Analysis
Lecture 28 STAT 225 Introduction to Probability Models April 23, 2014 Whitney Huang Purdue University 28.1 χ 2 test for For a given contingency table, we want to test if two have a relationship or not
More informationLesson 7.9 = ( _ + _ ) + _ = ( _ ) + _ = _ ALGEBRA
Name Fractions and Properties of Addition Essential Question How can you add fractions with like denominators using the properties of addition? connect The Associative and Commutative Properties of Addition
More informationMark Scheme (Results) June 2008
Mark Scheme (Results) June 008 GCE GCE Mathematics (669101) Edexcel Limited. Registered in England and Wales No. 4496750 Registered Office: One90 High Holborn, London WC1V 7BH 1 June 008 6691 Statistics
More informationPath Analysis. PRE 906: Structural Equation Modeling Lecture #5 February 18, PRE 906, SEM: Lecture 5 - Path Analysis
Path Analysis PRE 906: Structural Equation Modeling Lecture #5 February 18, 2015 PRE 906, SEM: Lecture 5 - Path Analysis Key Questions for Today s Lecture What distinguishes path models from multivariate
More informationHW Solution 5 Due: Sep 13
ECS 315: Probability and Random Processes 01/1 HW Solution 5 Due: Sep 13 Lecturer: Prapun Suksompong, Ph.D. Instructions (a) ONE part of a question will be graded (5 pt). Of course, you do not know which
More informationSequential Analysis & Testing Multiple Hypotheses,
Sequential Analysis & Testing Multiple Hypotheses, CS57300 - Data Mining Spring 2016 Instructor: Bruno Ribeiro 2016 Bruno Ribeiro This Class: Sequential Analysis Testing Multiple Hypotheses Nonparametric
More informationBasic Verification Concepts
Basic Verification Concepts Barbara Brown National Center for Atmospheric Research Boulder Colorado USA bgb@ucar.edu Basic concepts - outline What is verification? Why verify? Identifying verification
More informationA Bayesian mixture model for term re-occurrence and burstiness
A Bayesian mixture model for term re-occurrence and burstiness Avik Sarkar 1, Paul H Garthwaite 2, Anne De Roeck 1 1 Department of Computing, 2 Department of Statistics The Open University Milton Keynes,
More informationTime: 1 hour 30 minutes
Paper Reference(s) 6684/0 Edexcel GCE Statistics S Silver Level S Time: hour 30 minutes Materials required for examination papers Mathematical Formulae (Green) Items included with question Nil Candidates
More informationModel Estimation Example
Ronald H. Heck 1 EDEP 606: Multivariate Methods (S2013) April 7, 2013 Model Estimation Example As we have moved through the course this semester, we have encountered the concept of model estimation. Discussions
More information