Transactions on Business and Engineering Intelligent Applications 71

Size: px
Start display at page:

Download "Transactions on Business and Engineering Intelligent Applications 71"

Transcription

1 Transactions on Business and Engineering Intelligent Applications 71 EVALUATING EFFECTIVENESS OF LINGUISTIC TECHNOLOGIES OF KNOWLEDGE IDENTIFICATION IN TEXT COLLECTIONS Nina Khairova, Gennady Shepelyov, Svetlana Petrasova Abstract: The possibility of using integral coefficients of recall and precision to evaluate effectiveness of linguistic technologies of knowledge identification in texts is analyzed in the paper. An approach is based on the method of test collections, which is used for experimental validation of received effectiveness coefficients, and on methods of mathematical statistics. The problem of maximizing the reliability of sample results in their propagation on the general population of the tested text collection is studied. The method for determining the confidence interval for the attribute proportion, which is based on Wilson s formula, and the method for determining the required size of the relevant sample under specified relative error and confidence probability, are considered. Keywords: recall, precision, relevance, confidence interval, sample size. ACM Classification Keywords: I.2.7. Natural Language Processing, G.3. Probability and Statistics Statistical Computing. Introduction Nowadays linguistic technologies have become not only tools for modelling language but also the production factor. Computer linguistics is getting now the most strongly developing direction of information technologies. In fact, every intelligent information system with a user interface, both text and web-content processing systems, uses linguistic technologies. The effectiveness of such technologies depends on morphological, contextual, and syntactic analysis and synthesis as well as on solving the semantic analysis problems. The number of linguistic and information approaches to solve this problem is constantly growing. Therefore common metrics should be introduced for evaluation of the effectiveness of such technologies and their comparison. But currently, there are no standard benchmarks to measure effectiveness using mentioned technologies in text collections. We propose here some indicators and test their reliability. Usually the method of test collections is used for estimating the effectiveness of linguistic technologies in different systems of text classification, information retrieval, text mining, opinion mining, web mining etc. [Cormack, 1998]. The essence of this method consists in comparing the results of the tested technology at predetermined texts with expert evaluation for the same texts. However comparing results of the method with experts opinions generates the two main problems: expert subjectivity; the need for the determination of the text collection size to make experimental results reliable. Notion the reliability means here that experimental results, which were received, will be true under certain conditions also in the framework of a certain wider class of objects. Integral Effectiveness Coefficients of Knowledge Identification Let s use the quantitative effectiveness coefficients of retrieval and classification approved by interstate standards for information, library science, and publishing [ISO 12620:2009]. These coefficients are precision, recall. All these coefficients are based on the subjectively determined concept of relevance. The concept of relevance is

2 72 ITHEA difficult to define and has a rather psychological nature. We use the definition of relevance [Mizzaro, 1997], in which relevance depends on four concepts of Relevance (IR, IN, C, T) Here IR is an information resource, which is presented by a set of collection texts for processing, IN are information needs, C is context and T is time. Relevance is defined by experts on the scale of relevant/irrelevant/undefined and shows the correspondence or discrepancy of a text to a certain knowledge domain. To calculate the coefficients of system recall and system precision for each domain of expert s knowledge, it is necessary to determine the following parameters: n - a number of elements identified by the system as relevant, which are relevant to the local domain knowledge from an expert s viewpoint too, n yn - a number of elements identified by the system as relevant, which are irrelevant to the local domain knowledge from an expert s viewpoint, n ny - a number of elements that the system has not identified as relevant, which are relevant to the local domain knowledge from an expert s viewpoint. Using these parameters coefficients of system precision and system recall are determined by the following formulas: n precision, (1) n n yn recall n. (2) n n ny Sampling Method for Text Collections As the determination of effectiveness indicators is based on the notion of relevance that expert defines subjectively, the reliability of recall, precision, and other effectiveness indicators of the linguistic technologies requires experimental verification on text collection. Since used text collections are huge it makes sense to study only a part of the objects from the experimental collection, that is to execute the so-called sampling research of the population and make valid conclusions about the properties of the whole population. As the general population of any model of natural language texts processing is tending to infinite size, the ratio of the sample size to the general population size is much less than 5-10%, therefore the mathematical apparatus of the sampling with replacement theory can be used how it was shown by Chetyrkin [Chetyrkin, 1982]. Furthermore, in some cases, using the necessary correction coefficient, the results for a sample with replacement can be transferred to the corresponding results for a sample without replacement. Within our issue let s evaluate an attribute proportion in the general population on a basis of the corresponding attribute proportion in the sample. Let s consider the share of relevant texts in the collection R as the attribute proportion that shows the ratio of the number of relevant texts to the total number of collection texts. The sample estimate R S of the proportion R is R S = M / N, where N is the size of an experimental return sample, and M is the number of identified relevant texts in a sample using the identification method. It can be shown that the evaluation satisfies all of the requirements to statistical estimates (consistency, unbiasedness, sufficiency and effectiveness) [Chetyrkin, 1982].

3 Transactions on Business and Engineering Intelligent Applications 73 Since the sample estimate R S is a point estimate of the attribute proportion, the interval estimation R S should be used in order to find the sampling error. Since sampling errors are random variables with the same probability distribution, we can define interval estimate within which the attribute proportion of the population will be found with a certain confidence probability P. Usually, this approach leads to three issue types: determination of the confidence probability for a given confidence interval and sample size; determination of the confidence interval for a given confidence probability and sample size; determination of the necessary sample size for a given confidence probability and error limit. The determination of the confidence interval and the necessary size of the sample are the most important in our problem. Determination of the Confidence Interval for a Given Confidence Probability and Sample Size The determination of the confidence interval of the attribute proportion is based on the binomial distribution law [Clopper, 1934]. However, starting from samples that are more than 20 in size, the binomial distribution is symmetrized and is well approximated by normal distribution with parameters: average <R S> = R, variance D(R S) = R(1 - R)/N, standard deviation σ(r S) = [D(R S)] 1/2. In this case the confidence interval can be calculated using the formula: P( R R S < E α) = 2Ф(Z α) = 1 α, ( 7) where Ф(Z α) is the Laplace function. The margin error of the sample is found from the equation: E α = Z ασ(r S). ( 8) Let s choose the value of 0.95 usually used as the value of confidence probability, then the significance level α is At that Z 0.05 = Then we can get expressions for right and left limits of the confidence interval R from the relation: R R S < Z α[r(1 - R)/N] 1/2. ( 9) To do so we should solve the corresponding quadratic equation for R [Wilson, 1927]. The adequacy of using this approach to estimate the confidence intervals of the attribute proportion for small samples was proved by L.D. Brown and M. A. García-Pérez [Brown, 2001; Garcia-Perez, 2005]. Using the results of the paper [Agresti, 1998], we can get values of confidence limits in simpler way: Z α = R R S /[R(1 - R)/N] 1/2, (10) Determination of the Necessary Sample Size To determine the size of the necessary sample for given confidence probability and the margin of error, we replace R R S in (10) with E and determine N. We can see that: N = [Z 2 R S(1 R S)]/E 2. (11) The ratio (11) for size of the sample includes yet unknown sample proportion R S. Since this proportion is unknown, it is reasonable to determine it so that the size of the sample N is maximal. Then it will be acceptable for all feasible R S. It is easy to see that the maximum N as the function from R S is reached at R S = ½, that is N MAX = Z 2 /4E 2. Certainly, if there are a priori assumptions about the value of the attribute proportion during the research (by analogy, from the experience), this value should be used in the ratio (11). The necessary size of the sample is less than maximum one.

4 74 ITHEA Quite often the value of the margin of error E = 0.05 is used for determining the proportions of attributes. Using MS-Excel, let s consider the following illustrative example given in Figure 1. Figure 1. Example of the Evaluation of the Necessary Size of Sample Let us have a sample with replacement of size N = 10 objects and the attribute proportion of R S = 0.9. Using Goal Seek procedure, we get values of right and left confidence limits for the attribute proportion in the population R for 0.95 confidence probability: 0.59 < R < The obtained confidence interval is too wide. Let s find the maximum size of the sample N MAX for the same confidence probability 0.95 (then Z = 1.96) and the margin error E = 0.05 (recall that N MAX = Z 2 /4E 2 ). Rounding the computed necessary size of the sample up to an integer, we have: N MAX = 385. Using Goal Seek once again, we get a new narrower confidence interval: 0.87 < R < It has been achieved at the cost of considerable increase of the necessary size of the sample. Conclusion Thus this paper substantiates the usage of recall and precision to evaluate effectiveness of knowledge identification by means of linguistic technologies in text collections. The methods of mathematical statistics are used for determining the evaluation error of chosen coefficients. We considered the problem of the evaluation of the results obtained from the sample and evaluated an attribute proportion in the general population on a basis of the corresponding attribute proportion in the sample. The attribute proportion is considered as a share of relevant texts in the collection. It shows the ratio of the number of relevant texts to the total number of collection texts. The confidence interval for the attribute proportion was computed and the necessary size of the relevant sample was determined in given confidence probability. The experimentally determined values of recall and precision coefficients for the sample size correspond to the values of the same coefficients for the complete text collection for given confidence probability equal to 0.95 and the error limit equal to Bibliography [Agresti, 1998] Agresti A., B. Coull A. Approximate is better than exact for interval estimation of binomial proportions // American statistician N 52. С [Brown, 2001]. L. D. Brown, T. T. Cai, A. Dasgupta. D. Interval estimation for a binomial proportion // Statistical science N 2. P [Chetyrkin, 1982] Chetyrkin Ye. M., Kalichman I. L. Probability and Statistics. M.: Finance and Statistics, 1982 (in Russian). [Clopper, 1934] Clopper C. J., E. S. Pearson. The use of confidence or fiducial limits illustrated in the case of the binomial // Biometrika N 26. P

5 Transactions on Business and Engineering Intelligent Applications 75 [Cormack, 1998] Cormack G.V. A Efficient construction of large test collections / G. V. Cormack, C. R. Palmer, C. L. Clarke // Proc. of the SIGIR'98 Р [Garcia-Perez, 2005] Garcia-Perez M. A. On the confidence interval for the binomial parameter // Quality and quantity N 39. P [ISO 12620:2009] "ISO 12620: Terminology and other language and content resources -- Specification of data categories and management of a Data Category Registry for language resources". Retrieved 9 November [Mizzaro, 1997] Mizzaro S. Relevance: The whole history. Journal of American Society for Information Science V Р [Wilson, 1927] Wilson E. B. Probable inference, the law of succession, and statistical inference //Journal of American Statistical Association N 22. P Author's Information Nina Khairova Doctor of Computer Linguistics, Professor of Intelligent Computer Systems Department of National Technical University Kharkiv Polytechnic Institute, 21, Frunze str., Kharkiv, Ukraine, nina_khajrova@yahoo.com Major Fields of Scientific Research: artificial intelligence, knowledge identification in texts, text mining, opinion mining, web mining, natural language processing. Gennady Shepelyov Head of Laboratory "Computer systems based on knowledge" of Institute for systems studies of RAS, Prospect 60-letiya Oktyabrya, 9 Moscow Russia gis@isa.ru Major Fields of Scientific Research: mathematical modelling, probabilistic methods, interval analysis, generalized interval estimations, comparing interval alternatives. Svetlana Petrasova Postgraduate of Intelligent Computer Systems Department of National Technical University Kharkiv Polytechnic Institute, 21, Frunze str., Kharkiv, Ukraine, svetapetrasova@gmail.com Major Fields of Scientific Research: artificial intelligence, knowledge engineering, intelligent systems of knowledge representation, computer linguistics, natural language processing.

Lawrence D. Brown, T. Tony Cai and Anirban DasGupta

Lawrence D. Brown, T. Tony Cai and Anirban DasGupta Statistical Science 2005, Vol. 20, No. 4, 375 379 DOI 10.1214/088342305000000395 Institute of Mathematical Statistics, 2005 Comment: Fuzzy and Randomized Confidence Intervals and P -Values Lawrence D.

More information

Using Conservative Estimation for Conditional Probability instead of Ignoring Infrequent Case

Using Conservative Estimation for Conditional Probability instead of Ignoring Infrequent Case Using Conservative Estimation for Conditional Probability instead of Ignoring Infrequent Case Masato Kikuchi, Eiko Yamamoto, Mitsuo Yoshida, Masayuki Okabe, Kyoji Umemura Department of Computer Science

More information

Estimating binomial proportions

Estimating binomial proportions Estimating binomial proportions Roberto Rossi University of Edinburgh Business School Edinburgh, U Problem description Consider the problem of estimating the parameter p of a random variable which follows

More information

A Note on the Effect of Term Weighting on Selecting Intrinsic Dimensionality of Data

A Note on the Effect of Term Weighting on Selecting Intrinsic Dimensionality of Data BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 9, No 1 Sofia 2009 A Note on the Effect of Term Weighting on Selecting Intrinsic Dimensionality of Data Ch. Aswani Kumar 1,

More information

Statistics for Applications Spring 2009

Statistics for Applications Spring 2009 MIT OpenCourseWare http://ocw.mit.edu 18.443 Statistics for Applications Spring 2009 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. CONFIDENCE INTERVALS

More information

Weizhen Wang & Zhongzhan Zhang

Weizhen Wang & Zhongzhan Zhang Asymptotic infimum coverage probability for interval estimation of proportions Weizhen Wang & Zhongzhan Zhang Metrika International Journal for Theoretical and Applied Statistics ISSN 006-1335 Volume 77

More information

International Journal "Information Theories & Applications" Vol.14 /

International Journal Information Theories & Applications Vol.14 / International Journal "Information Theories & Applications" Vol.4 / 2007 87 or 2) Nˆ t N. That criterion and parameters F, M, N assign method of constructing sample decision function. In order to estimate

More information

The Axiometrics Project

The Axiometrics Project The Axiometrics Project Eddy Maddalena and Stefano Mizzaro Department of Mathematics and Computer Science University of Udine Udine, Italy eddy.maddalena@uniud.it, mizzaro@uniud.it Abstract. The evaluation

More information

EVALUATING MISCLASSIFICATION PROBABILITY USING EMPIRICAL RISK 1. Victor Nedel ko

EVALUATING MISCLASSIFICATION PROBABILITY USING EMPIRICAL RISK 1. Victor Nedel ko 94 International Journal "Information Theories & Applications" Vol13 [Raudys, 001] Raudys S, Statistical and neural classifiers, Springer, 001 [Mirenkova, 00] S V Mirenkova (edel ko) A method for prediction

More information

arxiv: v1 [math.st] 5 Jul 2007

arxiv: v1 [math.st] 5 Jul 2007 EXPLICIT FORMULA FOR COSTRUCTIG BIOMIAL COFIDECE ITERVAL WITH GUARATEED COVERAGE PROBABILITY arxiv:77.837v [math.st] 5 Jul 27 XIJIA CHE, KEMI ZHOU AD JORGE L. ARAVEA Abstract. In this paper, we derive

More information

Epidemiology Wonders of Biostatistics Chapter 11 (continued) - probability in a single population. John Koval

Epidemiology Wonders of Biostatistics Chapter 11 (continued) - probability in a single population. John Koval Epidemiology 9509 Wonders of Biostatistics Chapter 11 (continued) - probability in a single population John Koval Department of Epidemiology and Biostatistics University of Western Ontario What is being

More information

Charles Geyer University of Minnesota. joint work with. Glen Meeden University of Minnesota.

Charles Geyer University of Minnesota. joint work with. Glen Meeden University of Minnesota. Fuzzy Confidence Intervals and P -values Charles Geyer University of Minnesota joint work with Glen Meeden University of Minnesota http://www.stat.umn.edu/geyer/fuzz 1 Ordinary Confidence Intervals OK

More information

Good Confidence Intervals for Categorical Data Analyses. Alan Agresti

Good Confidence Intervals for Categorical Data Analyses. Alan Agresti Good Confidence Intervals for Categorical Data Analyses Alan Agresti Department of Statistics, University of Florida visiting Statistics Department, Harvard University LSHTM, July 22, 2011 p. 1/36 Outline

More information

Modeling User s Cognitive Dynamics in Information Access and Retrieval using Quantum Probability (ESR-6)

Modeling User s Cognitive Dynamics in Information Access and Retrieval using Quantum Probability (ESR-6) Modeling User s Cognitive Dynamics in Information Access and Retrieval using Quantum Probability (ESR-6) SAGAR UPRETY THE OPEN UNIVERSITY, UK QUARTZ Mid-term Review Meeting, Padova, Italy, 07/12/2018 Self-Introduction

More information

SMOOTHING AND PROGNOSIS OF MULTI-FACTOR TIME SERIES OF ECONOMICAL DATA BY MEANS OF LOCAL PROCEDURES (REGRESSION AND CURVATURE EVALUATION)

SMOOTHING AND PROGNOSIS OF MULTI-FACTOR TIME SERIES OF ECONOMICAL DATA BY MEANS OF LOCAL PROCEDURES (REGRESSION AND CURVATURE EVALUATION) 27 SMOOTHING AND PROGNOSIS OF MULTI-FACTOR TIME SERIES OF ECONOMICAL DATA BY MEANS OF LOCAL PROCEDURES (REGRESSION AND CURVATURE EVALUATION) Alexander Kovaldji, Vladimir Averkiev, Marina Sarkissyan Abstract:

More information

A New Confidence Interval for the Difference Between Two Binomial Proportions of Paired Data

A New Confidence Interval for the Difference Between Two Binomial Proportions of Paired Data UW Biostatistics Working Paper Series 6-2-2003 A New Confidence Interval for the Difference Between Two Binomial Proportions of Paired Data Xiao-Hua Zhou University of Washington, azhou@u.washington.edu

More information

A METHOD OF FINDING IMAGE SIMILAR PATCHES BASED ON GRADIENT-COVARIANCE SIMILARITY

A METHOD OF FINDING IMAGE SIMILAR PATCHES BASED ON GRADIENT-COVARIANCE SIMILARITY IJAMML 3:1 (015) 69-78 September 015 ISSN: 394-58 Available at http://scientificadvances.co.in DOI: http://dx.doi.org/10.1864/ijamml_710011547 A METHOD OF FINDING IMAGE SIMILAR PATCHES BASED ON GRADIENT-COVARIANCE

More information

DECISION MAKING SUPPORT AND EXPERT SYSTEMS

DECISION MAKING SUPPORT AND EXPERT SYSTEMS 325 ITHEA DECISION MAKING SUPPORT AND EXPERT SYSTEMS UTILITY FUNCTION DESIGN ON THE BASE OF THE PAIRED COMPARISON MATRIX Stanislav Mikoni Abstract: In the multi-attribute utility theory the utility functions

More information

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data

More information

Fuzzy Systems. Introduction

Fuzzy Systems. Introduction Fuzzy Systems Introduction Prof. Dr. Rudolf Kruse Christian Moewes {kruse,cmoewes}@iws.cs.uni-magdeburg.de Otto-von-Guericke University of Magdeburg Faculty of Computer Science Department of Knowledge

More information

An Analysis on Consensus Measures in Group Decision Making

An Analysis on Consensus Measures in Group Decision Making An Analysis on Consensus Measures in Group Decision Making M. J. del Moral Statistics and Operational Research Email: delmoral@ugr.es F. Chiclana CCI Faculty of Technology De Montfort University Leicester

More information

The effect on accuracy of tweet sample size for hashtag segmentation dictionary construction

The effect on accuracy of tweet sample size for hashtag segmentation dictionary construction The effect on accuracy of tweet sample size for hashtag segmentation dictionary construction Laurence A. F. Park and Glenn Stone School of Computing, Engineering and Mathematics Western Sydney University,

More information

Online Passive-Aggressive Algorithms. Tirgul 11

Online Passive-Aggressive Algorithms. Tirgul 11 Online Passive-Aggressive Algorithms Tirgul 11 Multi-Label Classification 2 Multilabel Problem: Example Mapping Apps to smart folders: Assign an installed app to one or more folders Candy Crush Saga 3

More information

USING SINGULAR VALUE DECOMPOSITION (SVD) AS A SOLUTION FOR SEARCH RESULT CLUSTERING

USING SINGULAR VALUE DECOMPOSITION (SVD) AS A SOLUTION FOR SEARCH RESULT CLUSTERING POZNAN UNIVE RSIY OF E CHNOLOGY ACADE MIC JOURNALS No. 80 Electrical Engineering 2014 Hussam D. ABDULLA* Abdella S. ABDELRAHMAN* Vaclav SNASEL* USING SINGULAR VALUE DECOMPOSIION (SVD) AS A SOLUION FOR

More information

Text mining and natural language analysis. Jefrey Lijffijt

Text mining and natural language analysis. Jefrey Lijffijt Text mining and natural language analysis Jefrey Lijffijt PART I: Introduction to Text Mining Why text mining The amount of text published on paper, on the web, and even within companies is inconceivably

More information

arxiv: v1 [stat.ap] 17 Mar 2012

arxiv: v1 [stat.ap] 17 Mar 2012 A Note On the Use of Fiducial Limits for Control Charts arxiv:1203.3882v1 [stat.ap] 17 Mar 2012 Mokin Lee School of Mechanical Automotive Engineering University of Ulsan Ulsan, South Korea Chanseok Park

More information

Collaborative topic models: motivations cont

Collaborative topic models: motivations cont Collaborative topic models: motivations cont Two topics: machine learning social network analysis Two people: " boy Two articles: article A! girl article B Preferences: The boy likes A and B --- no problem.

More information

STATISTICAL ANALYSIS AND COMPARISON OF SIMULATION MODELS OF HIGHLY DEPENDABLE SYSTEMS - AN EXPERIMENTAL STUDY. Peter Buchholz Dennis Müller

STATISTICAL ANALYSIS AND COMPARISON OF SIMULATION MODELS OF HIGHLY DEPENDABLE SYSTEMS - AN EXPERIMENTAL STUDY. Peter Buchholz Dennis Müller Proceedings of the 2009 Winter Simulation Conference M. D. Rossetti, R. R. Hill, B. Johansson, A. Dunkin, and R. G. Ingalls, eds. STATISTICAL ANALYSIS AND COMPARISON OF SIMULATION MODELS OF HIGHLY DEPENDABLE

More information

Bayesian Computation

Bayesian Computation Bayesian Computation CAS Centennial Celebration and Annual Meeting New York, NY November 10, 2014 Brian M. Hartman, PhD ASA Assistant Professor of Actuarial Science University of Connecticut CAS Antitrust

More information

Fuzzy and Randomized Confidence Intervals and P-values

Fuzzy and Randomized Confidence Intervals and P-values Fuzzy and Randomized Confidence Intervals and P-values Charles J. Geyer and Glen D. Meeden June 15, 2004 Abstract. The optimal hypothesis tests for the binomial distribution and some other discrete distributions

More information

Coverage Properties of Confidence Intervals for Proportions in Complex Sample Surveys

Coverage Properties of Confidence Intervals for Proportions in Complex Sample Surveys Coverage Properties of Confidence Intervals for Proportions in Complex Sample Surveys Carolina Franco Roderick J. A. Little Thomas A. Louis Eric V. Slud Abstract The challenges of constructing confidence

More information

An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition

An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition Yu-Seop Kim 1, Jeong-Ho Chang 2, and Byoung-Tak Zhang 2 1 Division of Information and Telecommunication

More information

Reports of the Institute of Biostatistics

Reports of the Institute of Biostatistics Reports of the Institute of Biostatistics No 02 / 2008 Leibniz University of Hannover Natural Sciences Faculty Title: Properties of confidence intervals for the comparison of small binomial proportions

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

JCICS Major Research Areas

JCICS Major Research Areas JCICS Major Research Areas Chemical Information Text Searching Structure and Substructure Searching Databases Patents George W.A. Milne C571 Lecture Fall 2002 1 JCICS Major Research Areas Chemical Computation

More information

A Simple Approximate Procedure for Constructing Binomial and Poisson Tolerance Intervals

A Simple Approximate Procedure for Constructing Binomial and Poisson Tolerance Intervals This article was downloaded by: [Kalimuthu Krishnamoorthy] On: 11 February 01, At: 08:40 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 107954 Registered office:

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Information Retrieval and Organisation Chapter 13 Text Classification and Naïve Bayes Dell Zhang Birkbeck, University of London Motivation Relevance Feedback revisited The user marks a number of documents

More information

Confidence Intervals. CAS Antitrust Notice. Bayesian Computation. General differences between Bayesian and Frequntist statistics 10/16/2014

Confidence Intervals. CAS Antitrust Notice. Bayesian Computation. General differences between Bayesian and Frequntist statistics 10/16/2014 CAS Antitrust Notice Bayesian Computation CAS Centennial Celebration and Annual Meeting New York, NY November 10, 2014 Brian M. Hartman, PhD ASA Assistant Professor of Actuarial Science University of Connecticut

More information

Uncertain Logic with Multiple Predicates

Uncertain Logic with Multiple Predicates Uncertain Logic with Multiple Predicates Kai Yao, Zixiong Peng Uncertainty Theory Laboratory, Department of Mathematical Sciences Tsinghua University, Beijing 100084, China yaok09@mails.tsinghua.edu.cn,

More information

A Regularized Recommendation Algorithm with Probabilistic Sentiment-Ratings

A Regularized Recommendation Algorithm with Probabilistic Sentiment-Ratings A Regularized Recommendation Algorithm with Probabilistic Sentiment-Ratings Filipa Peleja, Pedro Dias and João Magalhães Department of Computer Science Faculdade de Ciências e Tecnologia Universidade Nova

More information

DISTINGUING NON-DETERMINISTIC TIMED FINITE STATE MACHINES

DISTINGUING NON-DETERMINISTIC TIMED FINITE STATE MACHINES DISTINGUING NON-DETERMINISTIC TIMED FINITE STATE MACHINES Maxim Gromov 1, Khaled El-Fakih 2, Natalia Shabaldina 1, Nina Yevtushenko 1 1 Tomsk State University, 36 Lenin Str.. Tomsk, 634050, Russia gromov@sibmail.com,

More information

Interval estimation via tail functions

Interval estimation via tail functions Page 1 of 42 Interval estimation via tail functions Borek Puza 1,2 and Terence O'Neill 1, 6 May 25 1 School of Finance and Applied Statistics, Australian National University, ACT 2, Australia. 2 Corresponding

More information

SUPPORT VECTOR MACHINE

SUPPORT VECTOR MACHINE SUPPORT VECTOR MACHINE Mainly based on https://nlp.stanford.edu/ir-book/pdf/15svm.pdf 1 Overview SVM is a huge topic Integration of MMDS, IIR, and Andrew Moore s slides here Our foci: Geometric intuition

More information

CALCULATING OF RELIABILITY PARAMETERS OF MICROELECTRONIC COMPONENTS AND DEVICES BY MEANS OF VIRTUAL LABORATORY

CALCULATING OF RELIABILITY PARAMETERS OF MICROELECTRONIC COMPONENTS AND DEVICES BY MEANS OF VIRTUAL LABORATORY 134 I T H E A CALCULATING OF RELIABILITY PARAMETERS OF MICROELECTRONIC COMPONENTS AND DEVICES BY MEANS OF VIRTUAL LABORATORY Oleksandr Palagin, Peter Stanchev, Volodymyr Romanov, Krassimir Markov, Igor

More information

Fuzzy Systems. Introduction

Fuzzy Systems. Introduction Fuzzy Systems Introduction Prof. Dr. Rudolf Kruse Christoph Doell {kruse,doell}@iws.cs.uni-magdeburg.de Otto-von-Guericke University of Magdeburg Faculty of Computer Science Department of Knowledge Processing

More information

Language Models, Smoothing, and IDF Weighting

Language Models, Smoothing, and IDF Weighting Language Models, Smoothing, and IDF Weighting Najeeb Abdulmutalib, Norbert Fuhr University of Duisburg-Essen, Germany {najeeb fuhr}@is.inf.uni-due.de Abstract In this paper, we investigate the relationship

More information

Agreement Coefficients and Statistical Inference

Agreement Coefficients and Statistical Inference CHAPTER Agreement Coefficients and Statistical Inference OBJECTIVE This chapter describes several approaches for evaluating the precision associated with the inter-rater reliability coefficients of the

More information

The NEO problem: activities in Russia

The NEO problem: activities in Russia 47th Session of UN COPUOS Vienna, Austria, February 2010 Scientific and Technical Subcommittee The NEO problem: activities in Russia Boris Shustov The Expert Working Group on the АСН Problem by the Space

More information

A Fixed Point Representation of References

A Fixed Point Representation of References A Fixed Point Representation of References Susumu Yamasaki Department of Computer Science, Okayama University, Okayama, Japan yamasaki@momo.cs.okayama-u.ac.jp Abstract. This position paper is concerned

More information

Reducing Computation Time for the Analysis of Large Social Science Datasets

Reducing Computation Time for the Analysis of Large Social Science Datasets Reducing Computation Time for the Analysis of Large Social Science Datasets Douglas G. Bonett Center for Statistical Analysis in the Social Sciences University of California, Santa Cruz Jan 28, 2014 Overview

More information

PROD. TYPE: COM. Simple improved condence intervals for comparing matched proportions. Alan Agresti ; and Yongyi Min UNCORRECTED PROOF

PROD. TYPE: COM. Simple improved condence intervals for comparing matched proportions. Alan Agresti ; and Yongyi Min UNCORRECTED PROOF pp: --2 (col.fig.: Nil) STATISTICS IN MEDICINE Statist. Med. 2004; 2:000 000 (DOI: 0.002/sim.8) PROD. TYPE: COM ED: Chandra PAGN: Vidya -- SCAN: Nil Simple improved condence intervals for comparing matched

More information

Alignment of Resource and Reserve Classification Systems. Russian Federation CRIRSCO

Alignment of Resource and Reserve Classification Systems. Russian Federation CRIRSCO Alignment of Resource and Reserve Classification Systems Russian Federation and CRIRSCO Dr Stephen Henley Niall Young Committee for Mineral Reserves International Reporting Standards www.crirsco.com 1

More information

Iterative Laplacian Score for Feature Selection

Iterative Laplacian Score for Feature Selection Iterative Laplacian Score for Feature Selection Linling Zhu, Linsong Miao, and Daoqiang Zhang College of Computer Science and echnology, Nanjing University of Aeronautics and Astronautics, Nanjing 2006,

More information

Hidden Markov Models, I. Examples. Steven R. Dunbar. Toy Models. Standard Mathematical Models. Realistic Hidden Markov Models.

Hidden Markov Models, I. Examples. Steven R. Dunbar. Toy Models. Standard Mathematical Models. Realistic Hidden Markov Models. , I. Toy Markov, I. February 17, 2017 1 / 39 Outline, I. Toy Markov 1 Toy 2 3 Markov 2 / 39 , I. Toy Markov A good stack of examples, as large as possible, is indispensable for a thorough understanding

More information

Ranking-II. Temporal Representation and Retrieval Models. Temporal Information Retrieval

Ranking-II. Temporal Representation and Retrieval Models. Temporal Information Retrieval Ranking-II Temporal Representation and Retrieval Models Temporal Information Retrieval Ranking in Information Retrieval Ranking documents important for information overload, quickly finding documents which

More information

THE SHORTEST CONFIDENCE INTERVAL FOR PROPORTION IN FINITE POPULATIONS

THE SHORTEST CONFIDENCE INTERVAL FOR PROPORTION IN FINITE POPULATIONS APPLICATIOES MATHEMATICAE Online First version Wojciech Zieliński (Warszawa THE SHORTEST COFIDECE ITERVAL FOR PROPORTIO I FIITE POPULATIOS Abstract. Consider a finite population. Let θ (0, 1 denote the

More information

Plugin Confidence Intervals in Discrete Distributions

Plugin Confidence Intervals in Discrete Distributions Plugin Confidence Intervals in Discrete Distributions T. Tony Cai Department of Statistics The Wharton School University of Pennsylvania Philadelphia, PA 19104 Abstract The standard Wald interval is widely

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

An Efficient Decision Procedure for Functional Decomposable Theories Based on Dual Constraints

An Efficient Decision Procedure for Functional Decomposable Theories Based on Dual Constraints An Efficient Decision Procedure for Functional Decomposable Theories Based on Dual Constraints Khalil Djelloul Laboratoire d Informatique Fondamentale d Orléans. Bat. 3IA, rue Léonard de Vinci. 45067 Orléans,

More information

Formalization of GIS functionality

Formalization of GIS functionality Formalization of GIS functionality Over the past four decades humans have invested significantly in the construction of tools for handling digital representations of spaces and their contents. These include

More information

Nested Epistemic Logic Programs

Nested Epistemic Logic Programs Nested Epistemic Logic Programs Kewen Wang 1 and Yan Zhang 2 1 Griffith University, Australia k.wang@griffith.edu.au 2 University of Western Sydney yan@cit.uws.edu.au Abstract. Nested logic programs and

More information

Temporal Knowledge Acquisition From Multiple Experts

Temporal Knowledge Acquisition From Multiple Experts emporal Knowledge Acuisition From Multiple Experts Helen Kaikova, Vagan erziyan Metaintelligence Lab., Kharkov State echnical University of Radioelectronics, 4 Lenina Avenue, 3076 Kharkov, Ukraine, e-mail:

More information

The Concept of Geographic Relevance

The Concept of Geographic Relevance The Concept of Geographic Relevance Tumasch Reichenbacher, Stefano De Sabbata, Paul Crease University of Zurich Winterthurerstr. 190 8057 Zurich, Switzerland Keywords Geographic relevance, context INTRODUCTION

More information

Max-Sum Diversification, Monotone Submodular Functions and Semi-metric Spaces

Max-Sum Diversification, Monotone Submodular Functions and Semi-metric Spaces Max-Sum Diversification, Monotone Submodular Functions and Semi-metric Spaces Sepehr Abbasi Zadeh and Mehrdad Ghadiri {sabbasizadeh, ghadiri}@ce.sharif.edu School of Computer Engineering, Sharif University

More information

An Axiomatic Framework for Result Diversification

An Axiomatic Framework for Result Diversification An Axiomatic Framework for Result Diversification Sreenivas Gollapudi Microsoft Search Labs, Microsoft Research sreenig@microsoft.com Aneesh Sharma Institute for Computational and Mathematical Engineering,

More information

M E M O R A N D U M. Faculty Senate approved November 1, 2018

M E M O R A N D U M. Faculty Senate approved November 1, 2018 M E M O R A N D U M Faculty Senate approved November 1, 2018 TO: FROM: Deans and Chairs Becky Bitter, Sr. Assistant Registrar DATE: October 23, 2018 SUBJECT: Minor Change Bulletin No. 5 The courses listed

More information

An Entropy-based Method for Assessing the Number of Spatial Outliers

An Entropy-based Method for Assessing the Number of Spatial Outliers An Entropy-based Method for Assessing the Number of Spatial Outliers Xutong Liu, Chang-Tien Lu, Feng Chen Department of Computer Science Virginia Polytechnic Institute and State University {xutongl, ctlu,

More information

ArchaeoKM: Managing Archaeological data through Archaeological Knowledge

ArchaeoKM: Managing Archaeological data through Archaeological Knowledge Computer Applications and Quantitative Methods in Archeology - CAA 2010 Fco. Javier Melero & Pedro Cano (Editors) ArchaeoKM: Managing Archaeological data through Archaeological Knowledge A. Karmacharya

More information

A Large Deviation Bound for the Area Under the ROC Curve

A Large Deviation Bound for the Area Under the ROC Curve A Large Deviation Bound for the Area Under the ROC Curve Shivani Agarwal, Thore Graepel, Ralf Herbrich and Dan Roth Dept. of Computer Science University of Illinois Urbana, IL 680, USA {sagarwal,danr}@cs.uiuc.edu

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Dan Oneaţă 1 Introduction Probabilistic Latent Semantic Analysis (plsa) is a technique from the category of topic models. Its main goal is to model cooccurrence information

More information

Bayes theorem and its application to nuclear power plant safety

Bayes theorem and its application to nuclear power plant safety Bayes theorem and its application to nuclear power plant safety MATSUOKA Takeshi 1, 1 Collage of Nuclear Science and Technology, Harbin Engineering University No145-1, Nantong Street, Nangang District,

More information

Notes on Latent Semantic Analysis

Notes on Latent Semantic Analysis Notes on Latent Semantic Analysis Costas Boulis 1 Introduction One of the most fundamental problems of information retrieval (IR) is to find all documents (and nothing but those) that are semantically

More information

The Quantum Challenge in Concept Theory and Natural Language Processing

The Quantum Challenge in Concept Theory and Natural Language Processing The Quantum Challenge in Concept Theory and Natural Language Processing Diederik Aerts 1, Jan Broekaert 1, Sandro Sozzo 1 and Tomas Veloz 1,2 1 Center Leo Apostel for Interdisciplinary Studies and, Department

More information

Predicting Neighbor Goodness in Collaborative Filtering

Predicting Neighbor Goodness in Collaborative Filtering Predicting Neighbor Goodness in Collaborative Filtering Alejandro Bellogín and Pablo Castells {alejandro.bellogin, pablo.castells}@uam.es Universidad Autónoma de Madrid Escuela Politécnica Superior Introduction:

More information

NUMERICAL ACCURACY ISSUES IN USING EXCEL FOR SIMULATION STUDIES

NUMERICAL ACCURACY ISSUES IN USING EXCEL FOR SIMULATION STUDIES Proceedings of the 2004 Winter Simulation Conference R.G. Ingalls, M. D. Rossetti, J. S. Smith, and B. A. Peters, eds. NUMERICAL ACCURACY ISSUES IN USING EXCEL FOR SIMULATION STUDIES Kellie B. Keeling

More information

SRI Briefing Note Series No.8 Communicating uncertainty in seasonal and interannual climate forecasts in Europe: organisational needs and preferences

SRI Briefing Note Series No.8 Communicating uncertainty in seasonal and interannual climate forecasts in Europe: organisational needs and preferences ISSN 2056-8843 Sustainability Research Institute SCHOOL OF EARTH AND ENVIRONMENT SRI Briefing Note Series No.8 Communicating uncertainty in seasonal and interannual climate forecasts in Europe: organisational

More information

arxiv: v1 [math.gm] 12 Feb 2018

arxiv: v1 [math.gm] 12 Feb 2018 1. COMPLETE SETS arxiv:1802.05124v1 [math.gm] 12 Feb 2018 THEOPHILUS AGAMA Abstract. In this paper we introduce the concept of completeness of sets. We study this property on the set of integers. We examine

More information

Classification of Publications Based on Statistical Analysis of Scientific Terms Distributions

Classification of Publications Based on Statistical Analysis of Scientific Terms Distributions AUSTRIAN JOURNAL OF STATISTICS Volume 37 (2008), Number 1, 109 118 Classification of Publications Based on Statistical Analysis of Scientific Terms Distributions Vaidas Balys and Rimantas Rudzkis Institute

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information

Fitting Aggregation Functions to Data: Part II Idempotization

Fitting Aggregation Functions to Data: Part II Idempotization Fitting Aggregation Functions to Data: Part II Idempotization Maciej Bartoszuk 1, Gleb Beliakov 2, Marek Gagolewski 3,1, and Simon James 2 1 Faculty of Mathematics and Information Science, Warsaw University

More information

Means or "expected" counts: j = 1 j = 2 i = 1 m11 m12 i = 2 m21 m22 True proportions: The odds that a sampled unit is in category 1 for variable 1 giv

Means or expected counts: j = 1 j = 2 i = 1 m11 m12 i = 2 m21 m22 True proportions: The odds that a sampled unit is in category 1 for variable 1 giv Measures of Association References: ffl ffl ffl Summarize strength of associations Quantify relative risk Types of measures odds ratio correlation Pearson statistic ediction concordance/discordance Goodman,

More information

Fuzzy and Randomized Confidence Intervals and P-values

Fuzzy and Randomized Confidence Intervals and P-values Fuzzy and Randomized Confidence Intervals and P-values Charles J. Geyer and Glen D. Meeden December 6, 2004 Abstract. The optimal hypothesis tests for the binomial distribution and some other discrete

More information

What are the Findings?

What are the Findings? What are the Findings? James B. Rawlings Department of Chemical and Biological Engineering University of Wisconsin Madison Madison, Wisconsin April 2010 Rawlings (Wisconsin) Stating the findings 1 / 33

More information

Evaluation Strategies

Evaluation Strategies Evaluation Intrinsic Evaluation Comparison with an ideal output: Challenges: Requires a large testing set Intrinsic subjectivity of some discourse related judgments Hard to find corpora for training/testing

More information

A REVIEW ARTICLE ON NAIVE BAYES CLASSIFIER WITH VARIOUS SMOOTHING TECHNIQUES

A REVIEW ARTICLE ON NAIVE BAYES CLASSIFIER WITH VARIOUS SMOOTHING TECHNIQUES Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 10, October 2014,

More information

RSA ENCRYPTION USING THREE MERSENNE PRIMES

RSA ENCRYPTION USING THREE MERSENNE PRIMES Int. J. Chem. Sci.: 14(4), 2016, 2273-2278 ISSN 0972-768X www.sadgurupublications.com RSA ENCRYPTION USING THREE MERSENNE PRIMES Ch. J. L. PADMAJA a*, V. S. BHAGAVAN a and B. SRINIVAS b a Department of

More information

Bayesian performance

Bayesian performance Bayesian performance Frequentist properties of estimators refer to the performance of an estimator (say the posterior mean) over repeated experiments under the same conditions. The posterior distribution

More information

CONFIDENCE INTERVAL DETERMINATION OF ON-TIME PERFORMANCE WITHIN MEASUREMENT OF THE TRANSIT TIME FOR SINGLE PIECE PRIORITY MAIL

CONFIDENCE INTERVAL DETERMINATION OF ON-TIME PERFORMANCE WITHIN MEASUREMENT OF THE TRANSIT TIME FOR SINGLE PIECE PRIORITY MAIL CONFIDENCE INTERVAL DETERMINATION OF ON-TIME PERFORMANCE WITHIN MEASUREMENT OF THE TRANSIT TIME FOR SINGLE PIECE PRIORITY MAIL Libor Švadlenka Kateřina Pojkarová Daniel Salava Abstract The paper is focused

More information

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model Ranked retrieval Thus far, our queries have all been Boolean. Documents either

More information

Probability and Information Theory. Sargur N. Srihari

Probability and Information Theory. Sargur N. Srihari Probability and Information Theory Sargur N. srihari@cedar.buffalo.edu 1 Topics in Probability and Information Theory Overview 1. Why Probability? 2. Random Variables 3. Probability Distributions 4. Marginal

More information

Machine Learning Methods and Real-Time Economics

Machine Learning Methods and Real-Time Economics Machine Learning Methods and Real-Time Economics Volodymyr Dorozhinsky 1, Arsen Khadikov 1, and Nina Zholtkevych 2 School of Math. and Comp. Sci., V.N. Karazin Kharkiv National University 1 School of Economics,

More information

Fuzzy and Randomized Confidence Intervals and P -values

Fuzzy and Randomized Confidence Intervals and P -values Fuzzy and Randomized Confidence Intervals and P -values Charles J. Geyer and Glen D. Meeden May 23, 2005 Abstract. The optimal hypothesis tests for the binomial distribution and some other discrete distributions

More information

Applied Multivariate Statistical Analysis Richard Johnson Dean Wichern Sixth Edition

Applied Multivariate Statistical Analysis Richard Johnson Dean Wichern Sixth Edition Applied Multivariate Statistical Analysis Richard Johnson Dean Wichern Sixth Edition Pearson Education Limited Edinburgh Gate Harlow Essex CM20 2JE England and Associated Companies throughout the world

More information

Fuzzy Propositional Logic for the Knowledge Representation

Fuzzy Propositional Logic for the Knowledge Representation Fuzzy Propositional Logic for the Knowledge Representation Alexander Savinov Institute of Mathematics Academy of Sciences Academiei 5 277028 Kishinev Moldova (CIS) Phone: (373+2) 73-81-30 EMAIL: 23LSII@MATH.MOLDOVA.SU

More information

DATA MINING LECTURE 10

DATA MINING LECTURE 10 DATA MINING LECTURE 10 Classification Nearest Neighbor Classification Support Vector Machines Logistic Regression Naïve Bayes Classifier Supervised Learning 10 10 Illustrating Classification Task Tid Attrib1

More information

BAYESIAN inference is rooted from the well-known

BAYESIAN inference is rooted from the well-known , June 29 - July 1, 2016, London, U.K. Crafting a Lightweight Bayesian Inference Engine Feng-Jen Yang, Member, IAENG Abstract Many expert systems are counting on Bayesian inference to perform probabilistic

More information

Estimation and sample size calculations for correlated binary error rates of biometric identification devices

Estimation and sample size calculations for correlated binary error rates of biometric identification devices Estimation and sample size calculations for correlated binary error rates of biometric identification devices Michael E. Schuckers,11 Valentine Hall, Department of Mathematics Saint Lawrence University,

More information

A set theoretic view of the ISA hierarchy

A set theoretic view of the ISA hierarchy Loughborough University Institutional Repository A set theoretic view of the ISA hierarchy This item was submitted to Loughborough University's Institutional Repository by the/an author. Citation: CHEUNG,

More information

A New Approach to Knowledge Base Revision in DL-Lite

A New Approach to Knowledge Base Revision in DL-Lite A New Approach to Knowledge Base Revision in DL-Lite Zhe Wang and Kewen Wang and Rodney Topor School of ICT, Griffith University Nathan, QLD 4111, Australia Abstract Revising knowledge bases (KBs) in description

More information

Power Comparison of Exact Unconditional Tests for Comparing Two Binomial Proportions

Power Comparison of Exact Unconditional Tests for Comparing Two Binomial Proportions Power Comparison of Exact Unconditional Tests for Comparing Two Binomial Proportions Roger L. Berger Department of Statistics North Carolina State University Raleigh, NC 27695-8203 June 29, 1994 Institute

More information