CS284A: Representations and Algorithms in Molecular Biology

Size: px
Start display at page:

Download "CS284A: Representations and Algorithms in Molecular Biology"

Transcription

1 CS284A: Represetatios ad Algorithms i Molecular Biology Scribe Notes o Lectures 3 & 4: Motif Discovery via Eumeratio & Motif Represetatio Usig Positio Weight Matrix Joshua Gervi Based o presetatios by Professor Xiaohui Xie o Jauary 14 & 16, 2008 I Motif Discovery via Eumeratio A A Model for Motif Discovery (Review from Lecture 2) We wat to idetify biologically sigificat motifs i a set S of sequeces, s 1, s 2,, s Each potetially sigificat motif m i of legth w is associated with a summatio variable k i, which is the total umber of sequeces from S i which the motif appears To systematically measure this sigificace, we must first fid the uderlyig probability p ay sequece of legth l cotais ay theoretical motif of legth w With the overridig assumptio that the four bases are uiformly distributed, or ( P(A),P(C),P(G),P(T) ) = 1 4, 1 4, 1 4, 1 % # 4 ', we have calculated a value for p of & lw+1 # & % ( We use p as the probability of success for fidig this 4 w ' theoretical motif each time we sample a sequece from set S For k out of trials, the probability of success is biomial, P( k) = % ' p k ( 1( p) (k, # k& %! where ' = # k& k! ( ( k)!, as a motif either is i a sequece or is ot To test the sigificace of our specific motif m i, we evaluate a p-value, or the probability, based o our distributio, that m i would appear i at least k i sequeces:

2 2 # P( k) = % & ( p k ( 1) p) )k k' k= k i k= k i If the p-value is smaller tha a chose sigificace level, 1 we ca say with some cofidece that our motif m i is biologically sigificat For large the biomial distributio is approximated by a ormal distributio, ad we ca map k i to a ew distributio ad compute the z-score to determie the sigificace of our motif m i B Problems with this Model 1 The assumptio that the four bases are uiformly distributed i the sequeces is ot ecessarily correct To be more accurate, we would eed to model the first-order statistics (ie, P(A), P(C), P(G), ad P(T)) of the ucleotide distributio 2 The model igores secod-order statistics Two bases might be more likely paired together tha distributed at radom (eg P(GA) P(G)P(A) ) The same could be also said for higher-order statistics C Cotrol Sequeces I order ot to rely o the assumptio of uiform distributio of bases to measure sigificace, we ca geerate a set of N cotrol sequeces, s o 1, s o 2,,s o N The assumptio is that our motif of iterest m i is ot sigificat i the cotrol sequeces Now we have two sets of sequeces Each m i is associated with two values k i ad k o i, which correspod with the umber of differet sequeces this motif appears i the sets of sequeces S ad S o, respectively Now to fid out if our motif m i is biologically sigificat, we choose the appropriate probability distributio for successfully fidig a motif i k out of trials There are two types to choose from: 1 The biomial distributio If the set S is idepedet of S o, we ca still model the probability of success P(k) o fidig a motif i k out of trials usig the biomial distributio If S S o (ie, the set S is a subset of S o ), choosig the appropriate distributio ow depeds o the size of both sets ad the distributio of our motif m i i them If the umber N of S o sequeces ad the umber k i o of sequeces cotaiig our motif are large compared to the umber of S sequeces, the the probability p of radomly pickig a sequece with our motif remais essetially uchaged for trials, ad we could still model the probability P(k)

3 3 usig the biomial distributio 2 For these scearios the oly chage we eed to make from the model i Part A is to adopt a differet uderlyig probability p of success for fidig a motif every time we sample a sequece For p we will use the relative frequecy k i N our motif m i is foud i the set S o This way, whe we ru k trials, we ca compare the distributios from both S ad S o to see if our motif ideed stads out i S The probability of success o k out of trials may be writte as P(k) = % # k& ' k o k % i ' 1( k o % i ' # N & # N & To test the sigificace of our motif, we calculate the p-value i the same fashio as we did before: P(k) For large we ca agai map k i to a ormal distributio with mea p ad variace p(1-p) ad compute the z-score 2 The hypergeometric distributio k= k i If S S o ad if either N or k i o is ot large compared to for a give m i, the sequece of trials is aalogous to samplig without replacemet The probability p of radomly pickig a sequece with our motif chages sigificatly over trials Hece, we caot use the biomial distributio, which assumes the same p for all trials The appropriate distributio is hypergeometric, where the probability of success o fidig a motif i k out of trials is P( k) = K% ' N ( K % ' # k &# ( k &, N% ' # & where K % ' is the umber of ways of choosig k sequeces with a # k & # N K& motif from the total umber K of sequeces with that motif, % ( is k ' the umber of ways of choosig -k sequeces without the motif from N% the total umber N-K of sequeces without the motif, ad ' is the # & (k o

4 4 umber of ways of choosig sequeces from the total umber N sequeces While usig this distributio to test the sigificace of our particular motif m i, we assig k o i to the value K Like before we calculate the p-value usig the summatio P(k) We caot compute a z-score here, as a ormal distributio does ot approximate a hypergeometric distributio for large k= k i II Represetatio of a Motif Usig a Positio Weight Matrix A What is a Positio Weight Matrix? Motifs are hardly ever represeted accurately by a uique cosecutive sequece of A s, C s, G s ad T s Istead, we create a positio weight matrix (PWM) to represet the frequecies of each base at each positio i the motif: G A T C Sometimes a positio weight matrix is represeted by a sequece logo, where the height of the letters represetig the ucleotides correlates with the frequecy that base is foud i differet sequeces cotaiig the motif: From the example above, positio 1 is said to be degeerate; there is o sigle ucleotide that represets the motif here O the other had positio 3 is said to be striget because the motif is well represeted by adeosie B Mathematical Represetatio of a Positio Weight Matrix The positio weight matrix for a motif of width w ca be expressed as

5 5 # w1 & % ( = % w2 (, % w3 ( % ( w4 ' where each row j represets A, C, G, or T, ad each colum i represets oe positio of the motif, ad is ormalized: 4 # ij =1 j=1 for all i = 1, 2, w For example θ 23 is the relative frequecy that guaie is foud i positio 2 of the motif C Likelihood of a Sequece If all the relative frequecies θ ij are give for the positio weight matrix θ, we ca measure the probability of geeratig a sequece S = (s 1, s 2,, s w ) This is also kow as the likelihood L(θ) of the sequece For example we ca use a positio weight matrix of width w = 3 to calculate likelihood of the sequece GGG It is simply the product of three relative frequecies θ 13, θ 23, ad θ 33 Geeralizig this usig mathematics, we fid the likelihood of a sequece S = (s 1, s 2,, s w ) give θ i is L() = P S ( ) = ij I( s i = j) where I s i = j w 4 #, i=1 j=1 ( ) = 1 if s i = j # 0 if ot Let us briefly go over a few sytax elemets First of all, the expressio P(S θ) represets a coditioal probability: We are askig, What is the likelihood of sequece S give the coditio that the positio weight matrix is θ? Secodly, the (ie, capital pi) otatio meas we take the product of the associated terms Fially, for coveiece we coverted the alphabetical strig (A, C, G, T) ito a umerical oe (1, 2, 3, 4) These umbers are represeted by the variable j i the above expressio Other ways of expressig the likelihood L(θ) are

6 6 L() = P S w # ( ) = P( s i i ) i=1 w = # i,si The coditioal probability P(s i θ i ) is the probability of geeratig a ucleotide elemet s i give its relative frequecy θ i We ca expad this idea further ad measure the likelihood for a set of sequeces S 1, S 2,, S give θ Sice we are assumig each sequece S k is geerated idepedetly from θ, this probability is simply the product of the relative frequecies i,ski represetig each ucleotide elemet s ki : L() = P S 1,,S i=1 ( ) = P( S k ) # w ## = i,ski Note that the sytax P(S 1, S 2,, S θ) represets a joit probability the probability of geeratig sequeces S 1, S 2,, S as well as a coditioal probability the probability give θ i=1 D Usig Maximum Likelihood to Estimate the Positioal Weight Matrix θ Ofte times we wat to costruct a positio weight matrix θ of legth w from observed sequece data For a set of sequeces S 1, S 2,, S represeted by the same θ, our strategy is to maximize the likelihood L(θ) over all possible values of θ ij This could be doe by settig the partial derivative L(#) # ij equal to zero ad solvig for θ ij ; however, it is much easier to take the partial derivative with respect to the log-likelihood fuctio (ie, the logarithm of the likelihood) ad set it to zero logl(#) # ij = 0 because the product associated with the likelihood L(θ) turs ito a sum Note that there are oly 3w ad ot 4w parameters for which we eed to solve, sice if we figure out θ i1, θ i2, ad θ i3, we ca use the relatio # ij =1to give us θ i4 4 j=1

7 7 Usig this method o a set of sequeces S 1, S 2,, S, all with the same θ, we ca derive a expressio for the relative frequecy ij = ij, which is simply the absolute frequecy of each ucleotide j for every colum i, divided by the total umber of sequeces Ofte times it is much harder to solve for the positio weight matrix θ It is quite likely withi a set of give sequeces S 1, S 2,, S that oly some sequeces cotai the motif, ad thus oly this subset ca geerate the weight matrix θ The problem is we do ot kow which sequeces form this subset Let us assume the rest of the o-motif (also called backgroud) sequeces form a subset geerated from a sigle distributio (ie, from a secod positio weight matrix θ o made up of idetical colums of p o = (p o A, p o C, p o G, p o T) = (p o 1, p o 2, p o 3, p o 4) The likelihood L(θ, θ o ) for this set of sequeces S 1, S 2,, S is ow ( ) = [ z k P( S k ) + ( 1# z k )P( S k o )] L(, o ) = P S 1,,S z,, o, # where z k = 1 if S k is geerated by % 0 if S k is geerated by o The problem of ot kowig if a sequece S k belogs to the motif (θ) or the backgroud model (θ o ) ca ow be expressed mathematically as ot kowig which value 0 or 1 to use for the biary fuctio z k associated with each S k Fortuately, we ca remove z from the equatio by itegratig the likelihood L(θ, θ o ) over all possible evets z: 3 ( ) = P( S 1,,S z,, o ) P S 1,,S, o After itegratio, we are left with L(, o ) = P S 1,,S, o # P( z) z ( ) = [ P( z k )P( S k ) + ( 1# P( z k ))P( S k o )] We may be fortuate to kow the probability P(z k =1) for the set of sequeces S 1, S 2,, S Represetig this probability as the costat α, the likelihood of the set may ow be writte as

8 8 ( ) = %[#P( S k ) + ( 1# )P( S k o )] L(, o ) = P S 1,,S, o Havig successfully expressed the likelihood as a fuctio of 3w o idepedet variables i,ski ad 3 idepedet variables i,ski, we ca ow use o our strategy of solvig for i,ski ad i,ski whe the likelihood is at a maximum However, settig the partial derivatives of the log-likelihood fuctio equal to zero is too difficult a task because the likelihood L(θ, θ o ) i this case is simply ot just a product of the idepedet variables We will implemet the EM Algorithm ext lecture to solve this maximum likelihood estimatio problem 1 Wikipedia, P-value, 2 The relative frequecy k o i N the motif is foud i the set So must also ot be close to 0 or 1 3 I geeral we ca calculate a margial probability from a coditioal or joit probability by removig oe of the variables usig itegratio ( ) = P( X,Y) P X = P( X Y) P( Y), Y where we take the sum over all possible evets Y From R Durbi, S Eddy, A Krogh, ad G Mitchiso, Biological Sequece Aalysis, Cambridge Uiversity Press, 2006, p 6 Y

Bayesian Methods: Introduction to Multi-parameter Models

Bayesian Methods: Introduction to Multi-parameter Models Bayesia Methods: Itroductio to Multi-parameter Models Parameter: θ = ( θ, θ) Give Likelihood p(y θ) ad prior p(θ ), the posterior p proportioal to p(y θ) x p(θ ) Margial posterior ( θ, θ y) is Iterested

More information

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1 EECS564 Estimatio, Filterig, ad Detectio Hwk 2 Sols. Witer 25 4. Let Z be a sigle observatio havig desity fuctio where. p (z) = (2z + ), z (a) Assumig that is a oradom parameter, fid ad plot the maximum

More information

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes.

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes. Term Test October 3, 003 Name Math 56 Studet Number Directio: This test is worth 50 poits. You are required to complete this test withi 50 miutes. I order to receive full credit, aswer each problem completely

More information

Lecture 12: November 13, 2018

Lecture 12: November 13, 2018 Mathematical Toolkit Autum 2018 Lecturer: Madhur Tulsiai Lecture 12: November 13, 2018 1 Radomized polyomial idetity testig We will use our kowledge of coditioal probability to prove the followig lemma,

More information

Topic 9: Sampling Distributions of Estimators

Topic 9: Sampling Distributions of Estimators Topic 9: Samplig Distributios of Estimators Course 003, 2016 Page 0 Samplig distributios of estimators Sice our estimators are statistics (particular fuctios of radom variables), their distributio ca be

More information

Class 23. Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science. Marquette University MATH 1700

Class 23. Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science. Marquette University MATH 1700 Class 23 Daiel B. Rowe, Ph.D. Departmet of Mathematics, Statistics, ad Computer Sciece Copyright 2017 by D.B. Rowe 1 Ageda: Recap Chapter 9.1 Lecture Chapter 9.2 Review Exam 6 Problem Solvig Sessio. 2

More information

Topic 9: Sampling Distributions of Estimators

Topic 9: Sampling Distributions of Estimators Topic 9: Samplig Distributios of Estimators Course 003, 2018 Page 0 Samplig distributios of estimators Sice our estimators are statistics (particular fuctios of radom variables), their distributio ca be

More information

Lecture 14: Graph Entropy

Lecture 14: Graph Entropy 15-859: Iformatio Theory ad Applicatios i TCS Sprig 2013 Lecture 14: Graph Etropy March 19, 2013 Lecturer: Mahdi Cheraghchi Scribe: Euiwoog Lee 1 Recap Bergma s boud o the permaet Shearer s Lemma Number

More information

4. Partial Sums and the Central Limit Theorem

4. Partial Sums and the Central Limit Theorem 1 of 10 7/16/2009 6:05 AM Virtual Laboratories > 6. Radom Samples > 1 2 3 4 5 6 7 4. Partial Sums ad the Cetral Limit Theorem The cetral limit theorem ad the law of large umbers are the two fudametal theorems

More information

Axioms of Measure Theory

Axioms of Measure Theory MATH 532 Axioms of Measure Theory Dr. Neal, WKU I. The Space Throughout the course, we shall let X deote a geeric o-empty set. I geeral, we shall ot assume that ay algebraic structure exists o X so that

More information

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab Sectio 12 Tests of idepedece ad homogeeity I this lecture we will cosider a situatio whe our observatios are classified by two differet features ad we would like to test if these features are idepedet

More information

Random Variables, Sampling and Estimation

Random Variables, Sampling and Estimation Chapter 1 Radom Variables, Samplig ad Estimatio 1.1 Itroductio This chapter will cover the most importat basic statistical theory you eed i order to uderstad the ecoometric material that will be comig

More information

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen) Goodess-of-Fit Tests ad Categorical Data Aalysis (Devore Chapter Fourtee) MATH-252-01: Probability ad Statistics II Sprig 2019 Cotets 1 Chi-Squared Tests with Kow Probabilities 1 1.1 Chi-Squared Testig................

More information

Topic 9: Sampling Distributions of Estimators

Topic 9: Sampling Distributions of Estimators Topic 9: Samplig Distributios of Estimators Course 003, 2018 Page 0 Samplig distributios of estimators Sice our estimators are statistics (particular fuctios of radom variables), their distributio ca be

More information

Infinite Sequences and Series

Infinite Sequences and Series Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet

More information

Direction: This test is worth 150 points. You are required to complete this test within 55 minutes.

Direction: This test is worth 150 points. You are required to complete this test within 55 minutes. Term Test 3 (Part A) November 1, 004 Name Math 6 Studet Number Directio: This test is worth 10 poits. You are required to complete this test withi miutes. I order to receive full credit, aswer each problem

More information

( ) = p and P( i = b) = q.

( ) = p and P( i = b) = q. MATH 540 Radom Walks Part 1 A radom walk X is special stochastic process that measures the height (or value) of a particle that radomly moves upward or dowward certai fixed amouts o each uit icremet of

More information

Ma 530 Introduction to Power Series

Ma 530 Introduction to Power Series Ma 530 Itroductio to Power Series Please ote that there is material o power series at Visual Calculus. Some of this material was used as part of the presetatio of the topics that follow. What is a Power

More information

Recurrence Relations

Recurrence Relations Recurrece Relatios Aalysis of recursive algorithms, such as: it factorial (it ) { if (==0) retur ; else retur ( * factorial(-)); } Let t be the umber of multiplicatios eeded to calculate factorial(). The

More information

Introduction to Computational Molecular Biology. Gibbs Sampling

Introduction to Computational Molecular Biology. Gibbs Sampling 18.417 Itroductio to Computatioal Molecular Biology Lecture 19: November 16, 2004 Scribe: Tushara C. Karuarata Lecturer: Ross Lippert Editor: Tushara C. Karuarata Gibbs Samplig Itroductio Let s first recall

More information

Lecture 10 October Minimaxity and least favorable prior sequences

Lecture 10 October Minimaxity and least favorable prior sequences STATS 300A: Theory of Statistics Fall 205 Lecture 0 October 22 Lecturer: Lester Mackey Scribe: Brya He, Rahul Makhijai Warig: These otes may cotai factual ad/or typographic errors. 0. Miimaxity ad least

More information

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function.

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function. MATH 532 Measurable Fuctios Dr. Neal, WKU Throughout, let ( X, F, µ) be a measure space ad let (!, F, P ) deote the special case of a probability space. We shall ow begi to study real-valued fuctios defied

More information

Statisticians use the word population to refer the total number of (potential) observations under consideration

Statisticians use the word population to refer the total number of (potential) observations under consideration 6 Samplig Distributios Statisticias use the word populatio to refer the total umber of (potetial) observatios uder cosideratio The populatio is just the set of all possible outcomes i our sample space

More information

1 Hash tables. 1.1 Implementation

1 Hash tables. 1.1 Implementation Lecture 8 Hash Tables, Uiversal Hash Fuctios, Balls ad Bis Scribes: Luke Johsto, Moses Charikar, G. Valiat Date: Oct 18, 2017 Adapted From Virgiia Williams lecture otes 1 Hash tables A hash table is a

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit theorems Throughout this sectio we will assume a probability space (Ω, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam. Probability ad Statistics FS 07 Secod Sessio Exam 09.0.08 Time Limit: 80 Miutes Name: Studet ID: This exam cotais 9 pages (icludig this cover page) ad 0 questios. A Formulae sheet is provided with the

More information

Lecture 11 and 12: Basic estimation theory

Lecture 11 and 12: Basic estimation theory Lecture ad 2: Basic estimatio theory Sprig 202 - EE 94 Networked estimatio ad cotrol Prof. Kha March 2 202 I. MAXIMUM-LIKELIHOOD ESTIMATORS The maximum likelihood priciple is deceptively simple. Louis

More information

CSE 527, Additional notes on MLE & EM

CSE 527, Additional notes on MLE & EM CSE 57 Lecture Notes: MLE & EM CSE 57, Additioal otes o MLE & EM Based o earlier otes by C. Grat & M. Narasimha Itroductio Last lecture we bega a examiatio of model based clusterig. This lecture will be

More information

Introduction to Computational Biology Homework 2 Solution

Introduction to Computational Biology Homework 2 Solution Itroductio to Computatioal Biology Homework 2 Solutio Problem 1: Cocave gap pealty fuctio Let γ be a gap pealty fuctio defied over o-egative itegers. The fuctio γ is called sub-additive iff it satisfies

More information

Randomized Algorithms I, Spring 2018, Department of Computer Science, University of Helsinki Homework 1: Solutions (Discussed January 25, 2018)

Randomized Algorithms I, Spring 2018, Department of Computer Science, University of Helsinki Homework 1: Solutions (Discussed January 25, 2018) Radomized Algorithms I, Sprig 08, Departmet of Computer Sciece, Uiversity of Helsiki Homework : Solutios Discussed Jauary 5, 08). Exercise.: Cosider the followig balls-ad-bi game. We start with oe black

More information

1 of 7 7/16/2009 6:06 AM Virtual Laboratories > 6. Radom Samples > 1 2 3 4 5 6 7 6. Order Statistics Defiitios Suppose agai that we have a basic radom experimet, ad that X is a real-valued radom variable

More information

Lecture 2: Monte Carlo Simulation

Lecture 2: Monte Carlo Simulation STAT/Q SCI 43: Itroductio to Resamplig ethods Sprig 27 Istructor: Ye-Chi Che Lecture 2: ote Carlo Simulatio 2 ote Carlo Itegratio Assume we wat to evaluate the followig itegratio: e x3 dx What ca we do?

More information

Math 155 (Lecture 3)

Math 155 (Lecture 3) Math 55 (Lecture 3) September 8, I this lecture, we ll cosider the aswer to oe of the most basic coutig problems i combiatorics Questio How may ways are there to choose a -elemet subset of the set {,,,

More information

Chapter 4. Fourier Series

Chapter 4. Fourier Series Chapter 4. Fourier Series At this poit we are ready to ow cosider the caoical equatios. Cosider, for eample the heat equatio u t = u, < (4.) subject to u(, ) = si, u(, t) = u(, t) =. (4.) Here,

More information

Sample Size Estimation in the Proportional Hazards Model for K-sample or Regression Settings Scott S. Emerson, M.D., Ph.D.

Sample Size Estimation in the Proportional Hazards Model for K-sample or Regression Settings Scott S. Emerson, M.D., Ph.D. ample ie Estimatio i the Proportioal Haards Model for K-sample or Regressio ettigs cott. Emerso, M.D., Ph.D. ample ie Formula for a Normally Distributed tatistic uppose a statistic is kow to be ormally

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS MASSACHUSTTS INSTITUT OF TCHNOLOGY 6.436J/5.085J Fall 2008 Lecture 9 /7/2008 LAWS OF LARG NUMBRS II Cotets. The strog law of large umbers 2. The Cheroff boud TH STRONG LAW OF LARG NUMBRS While the weak

More information

Expectation-Maximization Algorithm.

Expectation-Maximization Algorithm. Expectatio-Maximizatio Algorithm. Petr Pošík Czech Techical Uiversity i Prague Faculty of Electrical Egieerig Dept. of Cyberetics MLE 2 Likelihood.........................................................................................................

More information

Mixtures of Gaussians and the EM Algorithm

Mixtures of Gaussians and the EM Algorithm Mixtures of Gaussias ad the EM Algorithm CSE 6363 Machie Learig Vassilis Athitsos Computer Sciece ad Egieerig Departmet Uiversity of Texas at Arligto 1 Gaussias A popular way to estimate probability desity

More information

AMS570 Lecture Notes #2

AMS570 Lecture Notes #2 AMS570 Lecture Notes # Review of Probability (cotiued) Probability distributios. () Biomial distributio Biomial Experimet: ) It cosists of trials ) Each trial results i of possible outcomes, S or F 3)

More information

Basics of Probability Theory (for Theory of Computation courses)

Basics of Probability Theory (for Theory of Computation courses) Basics of Probability Theory (for Theory of Computatio courses) Oded Goldreich Departmet of Computer Sciece Weizma Istitute of Sciece Rehovot, Israel. oded.goldreich@weizma.ac.il November 24, 2008 Preface.

More information

L = n i, i=1. dp p n 1

L = n i, i=1. dp p n 1 Exchageable sequeces ad probabilities for probabilities 1996; modified 98 5 21 to add material o mutual iformatio; modified 98 7 21 to add Heath-Sudderth proof of de Fietti represetatio; modified 99 11

More information

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering CEE 5 Autum 005 Ucertaity Cocepts for Geotechical Egieerig Basic Termiology Set A set is a collectio of (mutually exclusive) objects or evets. The sample space is the (collectively exhaustive) collectio

More information

Simulation. Two Rule For Inverting A Distribution Function

Simulation. Two Rule For Inverting A Distribution Function Simulatio Two Rule For Ivertig A Distributio Fuctio Rule 1. If F(x) = u is costat o a iterval [x 1, x 2 ), the the uiform value u is mapped oto x 2 through the iversio process. Rule 2. If there is a jump

More information

Chapter 2 The Monte Carlo Method

Chapter 2 The Monte Carlo Method Chapter 2 The Mote Carlo Method The Mote Carlo Method stads for a broad class of computatioal algorithms that rely o radom sampligs. It is ofte used i physical ad mathematical problems ad is most useful

More information

An Introduction to Randomized Algorithms

An Introduction to Randomized Algorithms A Itroductio to Radomized Algorithms The focus of this lecture is to study a radomized algorithm for quick sort, aalyze it usig probabilistic recurrece relatios, ad also provide more geeral tools for aalysis

More information

Shannon s noiseless coding theorem

Shannon s noiseless coding theorem 18.310 lecture otes May 4, 2015 Shao s oiseless codig theorem Lecturer: Michel Goemas I these otes we discuss Shao s oiseless codig theorem, which is oe of the foudig results of the field of iformatio

More information

Massachusetts Institute of Technology

Massachusetts Institute of Technology 6.0/6.3: Probabilistic Systems Aalysis (Fall 00) Problem Set 8: Solutios. (a) We cosider a Markov chai with states 0,,, 3,, 5, where state i idicates that there are i shoes available at the frot door i

More information

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING Lectures MODULE 5 STATISTICS II. Mea ad stadard error of sample data. Biomial distributio. Normal distributio 4. Samplig 5. Cofidece itervals

More information

Massachusetts Institute of Technology

Massachusetts Institute of Technology Solutios to Quiz : Sprig 006 Problem : Each of the followig statemets is either True or False. There will be o partial credit give for the True False questios, thus ay explaatios will ot be graded. Please

More information

Math 475, Problem Set #12: Answers

Math 475, Problem Set #12: Answers Math 475, Problem Set #12: Aswers A. Chapter 8, problem 12, parts (b) ad (d). (b) S # (, 2) = 2 2, sice, from amog the 2 ways of puttig elemets ito 2 distiguishable boxes, exactly 2 of them result i oe

More information

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara Poit Estimator Eco 325 Notes o Poit Estimator ad Cofidece Iterval 1 By Hiro Kasahara Parameter, Estimator, ad Estimate The ormal probability desity fuctio is fully characterized by two costats: populatio

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit Theorems Throughout this sectio we will assume a probability space (, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Patter Recogitio Classificatio: No-Parametric Modelig Hamid R. Rabiee Jafar Muhammadi Sprig 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Ageda Parametric Modelig No-Parametric Modelig

More information

Chapter 6 Principles of Data Reduction

Chapter 6 Principles of Data Reduction Chapter 6 for BST 695: Special Topics i Statistical Theory. Kui Zhag, 0 Chapter 6 Priciples of Data Reductio Sectio 6. Itroductio Goal: To summarize or reduce the data X, X,, X to get iformatio about a

More information

6.867 Machine learning, lecture 7 (Jaakkola) 1

6.867 Machine learning, lecture 7 (Jaakkola) 1 6.867 Machie learig, lecture 7 (Jaakkola) 1 Lecture topics: Kerel form of liear regressio Kerels, examples, costructio, properties Liear regressio ad kerels Cosider a slightly simpler model where we omit

More information

Sequences, Mathematical Induction, and Recursion. CSE 2353 Discrete Computational Structures Spring 2018

Sequences, Mathematical Induction, and Recursion. CSE 2353 Discrete Computational Structures Spring 2018 CSE 353 Discrete Computatioal Structures Sprig 08 Sequeces, Mathematical Iductio, ad Recursio (Chapter 5, Epp) Note: some course slides adopted from publisher-provided material Overview May mathematical

More information

KLMED8004 Medical statistics. Part I, autumn Estimation. We have previously learned: Population and sample. New questions

KLMED8004 Medical statistics. Part I, autumn Estimation. We have previously learned: Population and sample. New questions We have previously leared: KLMED8004 Medical statistics Part I, autum 00 How kow probability distributios (e.g. biomial distributio, ormal distributio) with kow populatio parameters (mea, variace) ca give

More information

Class 27. Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science. Marquette University MATH 1700

Class 27. Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science. Marquette University MATH 1700 Class 7 Daiel B. Rowe, Ph.D. Departmet of Mathematics, Statistics, ad Computer Sciece Copyright 013 by D.B. Rowe 1 Ageda: Skip Recap Chapter 10.5 ad 10.6 Lecture Chapter 11.1-11. Review Chapters 9 ad 10

More information

Math 152. Rumbos Fall Solutions to Review Problems for Exam #2. Number of Heads Frequency

Math 152. Rumbos Fall Solutions to Review Problems for Exam #2. Number of Heads Frequency Math 152. Rumbos Fall 2009 1 Solutios to Review Problems for Exam #2 1. I the book Experimetatio ad Measuremet, by W. J. Youde ad published by the by the Natioal Sciece Teachers Associatio i 1962, the

More information

1 Introduction to reducing variance in Monte Carlo simulations

1 Introduction to reducing variance in Monte Carlo simulations Copyright c 010 by Karl Sigma 1 Itroductio to reducig variace i Mote Carlo simulatios 11 Review of cofidece itervals for estimatig a mea I statistics, we estimate a ukow mea µ = E(X) of a distributio by

More information

Chapter 7 COMBINATIONS AND PERMUTATIONS. where we have the specific formula for the binomial coefficients:

Chapter 7 COMBINATIONS AND PERMUTATIONS. where we have the specific formula for the binomial coefficients: Chapter 7 COMBINATIONS AND PERMUTATIONS We have see i the previous chapter that (a + b) ca be writte as 0 a % a & b%þ% a & b %þ% b where we have the specific formula for the biomial coefficiets: '!!(&)!

More information

Lecture 2: April 3, 2013

Lecture 2: April 3, 2013 TTIC/CMSC 350 Mathematical Toolkit Sprig 203 Madhur Tulsiai Lecture 2: April 3, 203 Scribe: Shubhedu Trivedi Coi tosses cotiued We retur to the coi tossig example from the last lecture agai: Example. Give,

More information

Computing Confidence Intervals for Sample Data

Computing Confidence Intervals for Sample Data Computig Cofidece Itervals for Sample Data Topics Use of Statistics Sources of errors Accuracy, precisio, resolutio A mathematical model of errors Cofidece itervals For meas For variaces For proportios

More information

FIR Filter Design: Part II

FIR Filter Design: Part II EEL335: Discrete-Time Sigals ad Systems. Itroductio I this set of otes, we cosider how we might go about desigig FIR filters with arbitrary frequecy resposes, through compositio of multiple sigle-peak

More information

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY GRADUATE DIPLOMA, 016 MODULE : Statistical Iferece Time allowed: Three hours Cadidates should aswer FIVE questios. All questios carry equal marks. The umber

More information

Discrete probability distributions

Discrete probability distributions Discrete probability distributios I the chapter o probability we used the classical method to calculate the probability of various values of a radom variable. I some cases, however, we may be able to develop

More information

Sums, products and sequences

Sums, products and sequences Sums, products ad sequeces How to write log sums, e.g., 1+2+ (-1)+ cocisely? i=1 Sum otatio ( sum from 1 to ): i 3 = 1 + 2 + + If =3, i=1 i = 1+2+3=6. The ame ii does ot matter. Could use aother letter

More information

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n. Jauary 1, 2019 Resamplig Methods Motivatio We have so may estimators with the property θ θ d N 0, σ 2 We ca also write θ a N θ, σ 2 /, where a meas approximately distributed as Oce we have a cosistet estimator

More information

STAT 516 Answers Homework 6 April 2, 2008 Solutions by Mark Daniel Ward PROBLEMS

STAT 516 Answers Homework 6 April 2, 2008 Solutions by Mark Daniel Ward PROBLEMS STAT 56 Aswers Homework 6 April 2, 28 Solutios by Mark Daiel Ward PROBLEMS Chapter 6 Problems 2a. The mass p(, correspods to either o the irst two balls beig white, so p(, 8 7 4/39. The mass p(, correspods

More information

Lecture Overview. 2 Permutations and Combinations. n(n 1) (n (k 1)) = n(n 1) (n k + 1) =

Lecture Overview. 2 Permutations and Combinations. n(n 1) (n (k 1)) = n(n 1) (n k + 1) = COMPSCI 230: Discrete Mathematics for Computer Sciece April 8, 2019 Lecturer: Debmalya Paigrahi Lecture 22 Scribe: Kevi Su 1 Overview I this lecture, we begi studyig the fudametals of coutig discrete objects.

More information

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2.

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2. SAMPLE STATISTICS A radom sample x 1,x,,x from a distributio f(x) is a set of idepedetly ad idetically variables with x i f(x) for all i Their joit pdf is f(x 1,x,,x )=f(x 1 )f(x ) f(x )= f(x i ) The sample

More information

Lecture 7: Properties of Random Samples

Lecture 7: Properties of Random Samples Lecture 7: Properties of Radom Samples 1 Cotiued From Last Class Theorem 1.1. Let X 1, X,...X be a radom sample from a populatio with mea µ ad variace σ

More information

6.3 Testing Series With Positive Terms

6.3 Testing Series With Positive Terms 6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial

More information

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4 MATH 30: Probability ad Statistics 9. Estimatio ad Testig of Parameters Estimatio ad Testig of Parameters We have bee dealig situatios i which we have full kowledge of the distributio of a radom variable.

More information

Lecture 10: Universal coding and prediction

Lecture 10: Universal coding and prediction 0-704: Iformatio Processig ad Learig Sprig 0 Lecture 0: Uiversal codig ad predictio Lecturer: Aarti Sigh Scribes: Georg M. Goerg Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved

More information

1 Models for Matched Pairs

1 Models for Matched Pairs 1 Models for Matched Pairs Matched pairs occur whe we aalyse samples such that for each measuremet i oe of the samples there is a measuremet i the other sample that directly relates to the measuremet i

More information

Approximations and more PMFs and PDFs

Approximations and more PMFs and PDFs Approximatios ad more PMFs ad PDFs Saad Meimeh 1 Approximatio of biomial with Poisso Cosider the biomial distributio ( b(k,,p = p k (1 p k, k λ: k Assume that is large, ad p is small, but p λ at the limit.

More information

GG313 GEOLOGICAL DATA ANALYSIS

GG313 GEOLOGICAL DATA ANALYSIS GG313 GEOLOGICAL DATA ANALYSIS 1 Testig Hypothesis GG313 GEOLOGICAL DATA ANALYSIS LECTURE NOTES PAUL WESSEL SECTION TESTING OF HYPOTHESES Much of statistics is cocered with testig hypothesis agaist data

More information

Statistics 511 Additional Materials

Statistics 511 Additional Materials Cofidece Itervals o mu Statistics 511 Additioal Materials This topic officially moves us from probability to statistics. We begi to discuss makig ifereces about the populatio. Oe way to differetiate probability

More information

Lecture 19: Convergence

Lecture 19: Convergence Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may

More information

Stat 421-SP2012 Interval Estimation Section

Stat 421-SP2012 Interval Estimation Section Stat 41-SP01 Iterval Estimatio Sectio 11.1-11. We ow uderstad (Chapter 10) how to fid poit estimators of a ukow parameter. o However, a poit estimate does ot provide ay iformatio about the ucertaity (possible

More information

Problems from 9th edition of Probability and Statistical Inference by Hogg, Tanis and Zimmerman:

Problems from 9th edition of Probability and Statistical Inference by Hogg, Tanis and Zimmerman: Math 224 Fall 2017 Homework 4 Drew Armstrog Problems from 9th editio of Probability ad Statistical Iferece by Hogg, Tais ad Zimmerma: Sectio 2.3, Exercises 16(a,d),18. Sectio 2.4, Exercises 13, 14. Sectio

More information

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 19

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 19 CS 70 Discrete Mathematics ad Probability Theory Sprig 2016 Rao ad Walrad Note 19 Some Importat Distributios Recall our basic probabilistic experimet of tossig a biased coi times. This is a very simple

More information

Chapter 6 Sampling Distributions

Chapter 6 Sampling Distributions Chapter 6 Samplig Distributios 1 I most experimets, we have more tha oe measuremet for ay give variable, each measuremet beig associated with oe radomly selected a member of a populatio. Hece we eed to

More information

Kurskod: TAMS11 Provkod: TENB 21 March 2015, 14:00-18:00. English Version (no Swedish Version)

Kurskod: TAMS11 Provkod: TENB 21 March 2015, 14:00-18:00. English Version (no Swedish Version) Kurskod: TAMS Provkod: TENB 2 March 205, 4:00-8:00 Examier: Xiagfeg Yag (Tel: 070 2234765). Please aswer i ENGLISH if you ca. a. You are allowed to use: a calculator; formel -och tabellsamlig i matematisk

More information

Confidence intervals summary Conservative and approximate confidence intervals for a binomial p Examples. MATH1005 Statistics. Lecture 24. M.

Confidence intervals summary Conservative and approximate confidence intervals for a binomial p Examples. MATH1005 Statistics. Lecture 24. M. MATH1005 Statistics Lecture 24 M. Stewart School of Mathematics ad Statistics Uiversity of Sydey Outlie Cofidece itervals summary Coservative ad approximate cofidece itervals for a biomial p The aïve iterval

More information

Properties and Hypothesis Testing

Properties and Hypothesis Testing Chapter 3 Properties ad Hypothesis Testig 3.1 Types of data The regressio techiques developed i previous chapters ca be applied to three differet kids of data. 1. Cross-sectioal data. 2. Time series data.

More information

UC Berkeley CS 170: Efficient Algorithms and Intractable Problems Handout 17 Lecturer: David Wagner April 3, Notes 17 for CS 170

UC Berkeley CS 170: Efficient Algorithms and Intractable Problems Handout 17 Lecturer: David Wagner April 3, Notes 17 for CS 170 UC Berkeley CS 170: Efficiet Algorithms ad Itractable Problems Hadout 17 Lecturer: David Wager April 3, 2003 Notes 17 for CS 170 1 The Lempel-Ziv algorithm There is a sese i which the Huffma codig was

More information

This is an introductory course in Analysis of Variance and Design of Experiments.

This is an introductory course in Analysis of Variance and Design of Experiments. 1 Notes for M 384E, Wedesday, Jauary 21, 2009 (Please ote: I will ot pass out hard-copy class otes i future classes. If there are writte class otes, they will be posted o the web by the ight before class

More information

On Random Line Segments in the Unit Square

On Random Line Segments in the Unit Square O Radom Lie Segmets i the Uit Square Thomas A. Courtade Departmet of Electrical Egieerig Uiversity of Califoria Los Ageles, Califoria 90095 Email: tacourta@ee.ucla.edu I. INTRODUCTION Let Q = [0, 1] [0,

More information

1 Review of Probability & Statistics

1 Review of Probability & Statistics 1 Review of Probability & Statistics a. I a group of 000 people, it has bee reported that there are: 61 smokers 670 over 5 960 people who imbibe (drik alcohol) 86 smokers who imbibe 90 imbibers over 5

More information

Economics Spring 2015

Economics Spring 2015 1 Ecoomics 400 -- Sprig 015 /17/015 pp. 30-38; Ch. 7.1.4-7. New Stata Assigmet ad ew MyStatlab assigmet, both due Feb 4th Midterm Exam Thursday Feb 6th, Chapters 1-7 of Groeber text ad all relevat lectures

More information

Big Picture. 5. Data, Estimates, and Models: quantifying the accuracy of estimates.

Big Picture. 5. Data, Estimates, and Models: quantifying the accuracy of estimates. 5. Data, Estimates, ad Models: quatifyig the accuracy of estimates. 5. Estimatig a Normal Mea 5.2 The Distributio of the Normal Sample Mea 5.3 Normal data, cofidece iterval for, kow 5.4 Normal data, cofidece

More information

Probability theory and mathematical statistics:

Probability theory and mathematical statistics: N.I. Lobachevsky State Uiversity of Nizhi Novgorod Probability theory ad mathematical statistics: Law of Total Probability. Associate Professor A.V. Zorie Law of Total Probability. 1 / 14 Theorem Let H

More information

1 Generating functions for balls in boxes

1 Generating functions for balls in boxes Math 566 Fall 05 Some otes o geeratig fuctios Give a sequece a 0, a, a,..., a,..., a geeratig fuctio some way of represetig the sequece as a fuctio. There are may ways to do this, with the most commo ways

More information

SDS 321: Introduction to Probability and Statistics

SDS 321: Introduction to Probability and Statistics SDS 321: Itroductio to Probability ad Statistics Lecture 23: Cotiuous radom variables- Iequalities, CLT Puramrita Sarkar Departmet of Statistics ad Data Sciece The Uiversity of Texas at Austi www.cs.cmu.edu/

More information

Confidence Intervals for the Population Proportion p

Confidence Intervals for the Population Proportion p Cofidece Itervals for the Populatio Proportio p The cocept of cofidece itervals for the populatio proportio p is the same as the oe for, the samplig distributio of the mea, x. The structure is idetical:

More information

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample. Statistical Iferece (Chapter 10) Statistical iferece = lear about a populatio based o the iformatio provided by a sample. Populatio: The set of all values of a radom variable X of iterest. Characterized

More information

(7 One- and Two-Sample Estimation Problem )

(7 One- and Two-Sample Estimation Problem ) 34 Stat Lecture Notes (7 Oe- ad Two-Sample Estimatio Problem ) ( Book*: Chapter 8,pg65) Probability& Statistics for Egieers & Scietists By Walpole, Myers, Myers, Ye Estimatio 1 ) ( ˆ S P i i Poit estimate:

More information

April 18, 2017 CONFIDENCE INTERVALS AND HYPOTHESIS TESTING, UNDERGRADUATE MATH 526 STYLE

April 18, 2017 CONFIDENCE INTERVALS AND HYPOTHESIS TESTING, UNDERGRADUATE MATH 526 STYLE April 18, 2017 CONFIDENCE INTERVALS AND HYPOTHESIS TESTING, UNDERGRADUATE MATH 526 STYLE TERRY SOO Abstract These otes are adapted from whe I taught Math 526 ad meat to give a quick itroductio to cofidece

More information