Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Size: px

Start display at page:

Download "Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University"

Archibald Boyd
5 years ago
Views:

1 Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

a The data necessary that describe component the of

2 Why uncertainty? Why should data mining care about uncertainty? We are living in the world where everything is possible! The world Modeling is full uncertainty of uncertainty is a The data necessary that describe component the of world express almost the uncertainty all data analysis The way we seek model structures contains uncertainty

3 Formalizing the uncertainty Probability the most widely used tools for modeling the uncertainty with theoretical backbone Widespread application and acceptance Other tools Fuzzy logic Rough set

4 Two views of probability Frequentist (objective, based only on data) The probability of an event is defined as the limiting proportion of times that the event would occur in repetitions of essentially identical situations Bayesian (subjective, with a belief) Bayesian statistics explicitly characterize all forms of uncertainty in data analysis: Uncertainty about any parameters to be estimated from data Uncertainty as to which among the set of model structures are the best of closest to the ground-truth Uncertainty in any forecast to be made

5 Models and data probability MODEL DATA Statistical inference Probability specifies how the concerned properties of the observed data can be generated from the models Statistical inference makes statement about the concerned properties of the population.

6 Estimation Usually, there are numerous models that can generate the observed data points, how to determine which model to use? Statistical inference: 1. Assume a model family (or form). 2. Estimate the parameters to determine the specific model That s it!

7 Properties of estimators Bias The difference between the expected value of the estimator and the true value of parameter Unbiased estimators show no systematic departure from the true value Variance It measures the random, data-driven component of error in our estimation procedure Mean squared error The mean of the squared difference between the estimated value and the true value of a parameter Consistency The difference between the estimated value and true value approaches 0 as the sample size increases

8 Maximum likelihood estimation (MLE) Likelihood Likelihood describes the probability of the observed data generated from the model conditioned on the given parameter. Probability of the observed rather than the unseen Examples are iid The MLE The value for the parameter for which the data has the highest probability of having arisen. MLE is to find the most likely model that generates the data

9 Maximum likelihood estimation (MLE)

10 Maximum likelihood estimation (MLE) Remarks on MLE If g(.) is one-to-one function, then is an MLE of when is an MLE of MLE is a point estimate, which only cares about the best To capture some uncertainty, one can compute the confidence interval which specifying the region containing the true value Use normal approximation based on central limits theorem Use Boostrapping

11 Bayesian estimation Bayesian statistics treats the data as known and the parameters as random variables A prior distribution is associated with the parameter to express our prior belief of where the true parameter may be Analysis of D leads to modification of this distribution to take into account the empirical data, yielding posterior probability. Distribution instead of one value Maximum a posteriori (MAP) Selecting the mode of this posterior distribution as the MAP estimation

12 Bayesian estimation Belief vs. evidence What is probable value of θ that generating these evidence Fixed. If model is assumed Belief on where the value should be Could vary a lot! Different prior may lead to different distribution, thus a good balance should be made when choosing the prior

13 Bayesian estimation How to select the prior distribution? Ideal: coding the domain knowledge into the prior Based on the own experience or belief Cons: too subjective. Varies from person to person. Reference prior Less subjective (e.g., Jeffrey s prior) Conjugate prior where The resulted posterior distribution in the same family as the prior distribution. (e.g., Normal distribution)

14 Hypothesis testing Hypothesis testing is used to test whether the observed data support some idea about the value of a parameter. Key assumption: A random sample has been drawn from some distribution and the aim of the testing is to make probability statement about a parameter of that distribution General steps: Null hypothesis (H 0 ) vs. alternative hypothesis (H 1 ) Determine the distribution of some chosen statistics related to the nature of the problem Determine a reject region or critical region If the observed value falling into the reject region, accept H 1 otherwise H 0

15 Widely-used hypothesis test Goodness-of-fit: use the test to compare an observed distribution with a hypothesized distribution Pairwise t-test Chi-square test Distribution-free test: no distribution is assumed from which the sample is drawn Sign test rank sum test Wilcoxon test,

16 Sampling Sometimes, we need to conduct data mining over a sample of data points instead of the entire database Why sampling? Major issue is efficiency! How to create a qualified sample? Try to maintain the distribution information of the original data set Try to avoid using systematic sampling (i.e., avoid injected certain characteristics of the sampling method into the sampled data)

17 Widely-used sampling methods Simple random sampling Sampling each data point with equal probability (without replacement) Bootstrap sampling Sampling each data point with equal probability (with replacement) Stratified sampling Split data into nonoverlapping strata, a sample is drawn separately from within each stratum Cluster sampling If data are naturally grouped into clusters, sample clusters instead of data points and then use simple random sampling or all data points in the sampled clusters.

18 Let s move to Chapter 5

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout