Grundlagen der Künstlichen Intelligenz

Size: px

Start display at page:

Download "Grundlagen der Künstlichen Intelligenz"

Howard Todd
5 years ago
Views:

1 Grundlagen der Künstlichen Intelligenz Uncertainty & Probabilities & Bandits Daniel Hennes (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1

2 Today Uncertainty Probability Inference Random variables Independence and Bayes Rule Bandits Russell & Norvig: Chapter 13 Sheldon Ross: A first course in probability Christopher M. Bishop: Pattern recognition and machine learning 2

3 Quiz 1% of the population have illness X Test Y is an indicator for illness X Y delivers a positive test result in 99% of the cases in which the patient actually has illness X (true positive) Y delivers a positive test result in 10% of the cases in which the patient does not have X (false positive) Tim got a positive test result, what is the probability that he has illness X? (a) 100% (b) 84.3% (c) 78.2% (d) 9.1% (e) 0.8% 3

4 Probability theory Probability theory is a mathematical framework for representing uncertain statements Quantifying uncertainty Axioms for deriving new uncertain statements Probability theory in AI: 1. How should systems reason and act under uncertainty? Inference! 2. How to analyze the behavior of proposed AI systems Probability theory: make uncertain statements reason in the presence of uncertainty Information theory: quantify the amount of uncertainty in a probability distribution 4

5 Why probability? Beyond mathematical statements, it is difficult to think of any proposition that is absolutely true or any event that is absolutely guaranteed to occur. Goodfellow & Bengio 1. Inherent (objective) stochasticity in the world/system quantum mechanics: dynamics of subatomic particles are probabilistic theoretical scenarios: card game (shuffled in random order) 2. Incomplete observability we cannot observe all the variables (hidden/latent variables) expressing information and lack of information 3. Incomplete modeling we usually discard some of the information we have observed 4. Lazyness simple uncertain rule vs. complex certain one 5

6 Inference Inference : Given some pieces of information (prior, observed variables) what is the implication (the implied information, the posterior) on a non-observed variable Probability theory provides a set of formal rules for determining the likelihood of a proposition being true given the likelihood of other propositions 6

7 Probability: Frequentist and Bayesian Probability theory was originally developed to analyze the frequencies of events Frequentist probabilities are defined in the limit of an infinite number of trials How likely is it that a particular coin lands heads up? If we would repeat the experiment infinitely often, the probability gives us the ratio of trials that deliver a particular outcome Bayesian (subjective) probabilities quantify a degree of belief Doctor diagnoses a patient, according to the doctor, the patient has a 40% chance of having the flu Not possible to replicate the patient infinitely often 7

8 Random variables A random variable X can take on different values randomly On its own, just a description of the states that are possible Must be coupled with a probability distribution that specifies how likely each of the states are Example: X is a random variable that represents a dice throw The domain of X is Ω = {1, 2, 3, 4, 5, 6} P(X = x) denotes a specific probability, e.g. the probability that X takes on x (dice shows face x) P(X) denotes the probability distribution 8

9 Probability mass functions A probability distribution over a discrete variable can be described using a probability mass function (PMF) A PMF maps from the state of a random variable to the probability of that random variable taking on that state Properties: The domain of P must be the set of all possible states Ω x Ω : P(X = x) 0 non-negativity x Ω P(X = x) = 1 normalization Examples: P(X) = [ 1 6, 1 6, 1 6, 1 6, 1 6, 1 6 ] P(X = x i ) = 1 n, n = Ω i P(X = x i) = i fair dice discrete uniform distribution 1 n = n n = 1 9

10 Joint probability distribution Assume we have two random variables X and Y Probability that X = x and Y = y simultaneously: P(X = x, Y = y) or P(X, Y ) for brevity 10

11 Marginal probability Given the joint probability distribution over a set of variables, we want to know the probability distribution over a subset of them: x : P(X = x) = y P(X = x, Y = y) P(X) = Y P(X, Y ) 11

12 Conditional probability Probability of some event, given that some other event has been observed: P(X = x, Y = y) P(X = x Y = y) = P(Y = y) P(X Y ) = P(X, Y ) P(Y ) The conditional is normalized: y : x P(X = x Y = y) = 1 12

13 Independence and conditional independence Two random variables X and Y are independent iff: x, y : P(X = x, Y = y) = P(X = x)p(y = y) X and Y are independent iff P(X, Y ) = P(X)P(Y ) X is independent of Y iff P(X Y ) = P(X) X Y means X and Y are independent Conditionally independence: P(X, Y Z) = P(X Z)P(Y Z) X Y Z means X and Y are conditionally independent given Z 13

14 Implications of conditional probability Conditional probability: Product rule / chain rule: P(X Y ) = P(X, Y ) P(Y ) P(X, Y ) = P(X Y )P(Y ) = P(Y X)P(X) P(X, Y, Z) = P(X Y, Z)P(Y Z)P(Z) Bayes rule: P(X Y ) = P(Y X)P(X) P(Y ) 14

15 Bayes rule P(X Y ) = posterior = P(Y X)P(X) P(Y ) likelihood prior normalization Note: We can usually compute P(Y ) = x P(Y x)p(x) 15

16 Multiple random variables Analogously for n random variables X 1:n Joint: P(X 1:n ) = P(X 1, X 2, X 3,... X n ) Marginal: P(X 1 ) = X 2:n P(X 1:n ) Conditional: P(X 1 X 2:n ) = P(X 1:n) P(X 2:n ) Chain rule: P(X 1:n ) = n i=1 P(X i X i+1:n ) Bayes rule: P(X 1 X 2:n ) = P(X 2 X 1,X 3:n ) P(X 1 X 3:n ) P(X 2 X 3:n ) Examples: P(X Y, Z) = P(X, Y Z) = P(Y X,Z) P(X Z) P(Y Z) P(X,Z Y ) P(Y ) P(Z) 16

17 Bernoulli distribution The Bernoulli distribution is a distribution over a single binary random variable: x {0, 1} It is parameterized by a single scalar µ P(x = 1) = µ P(x = 0) = 1 µ P(X = x) = Bernoulli(x; µ) = µ x (1 µ) 1 x E X [X] = µ Var X (X) = µ(1 µ) X Bernoulli(µ) 17

18 Binomial distribution The Binomial distribution is a distribution over the count of successes k in a sequence of n independent Bernoulli trials: Binomial(x; n, k, µ) = ( ) n µ k (1 µ) n k k 18

19 Continuous random variables Let x R be a continuous random variable The Probability Density Function (PDF) p(x) [0, ) defines the probability: P(a x b) = b a p(x)dx [0, 1] Domain of p must be set of all possible states of x x : p(x) 0 p(x)dx = 1 Note, we do not require p(x) 1! Cumulative Probability Distribution (CDF): F (y) = P(x y) = y p(x)dx [0, 1] with lim y F (y) = 1 19

20 Gaussian distribution N (x µ, σ 2 ) 2σ µ Univariate normal distribution: N (x; µ, σ 2 1 ) = 2πσ 2 e x 1 2σ 2 (x µ)2 Multivariate normal distribution: ( 1 N (x; µ, Σ) = (2π) n det(σ) exp 1 ) 2 (x µ)t Σ 1 (x µ) 20

21 Quiz 1% of the population have illness X Test Y is an indicator for illness X Y delivers a positive test result in 99% of the cases in which the patient actually has illness X (true positive) Y delivers a positive test result in 10% of the cases in which the patient does not have X (false positive) 21

22 Quiz 1% of the population have illness X Test Y is an indicator for illness X Y delivers a positive test result in 99% of the cases in which the patient actually has illness X (true positive) Y delivers a positive test result in 10% of the cases in which the patient does not have X (false positive) P(x) = 0.01 P(y x) = 0.99 P(y x) =

23 Quiz 1% of the population have illness X Test Y is an indicator for illness X Y delivers a positive test result in 99% of the cases in which the patient actually has illness X (true positive) Y delivers a positive test result in 10% of the cases in which the patient does not have X (false positive) P(x) = 0.01 P(y x) = 0.99 P(y x) = 0.1 P(X Y ) = P(Y X)P(X) P(Y ) 21

24 Quiz 1% of the population have illness X Test Y is an indicator for illness X Y delivers a positive test result in 99% of the cases in which the patient actually has illness X (true positive) Y delivers a positive test result in 10% of the cases in which the patient does not have X (false positive) P(x) = 0.01 P(y x) = 0.99 P(y x) = 0.1 P(X Y ) = P(Y X)P(X) P(Y ) P(y) = x P(y x)p(x) = P(y x)p(x) + P(y x)p( x) 21

25 Quiz 1% of the population have illness X Test Y is an indicator for illness X Y delivers a positive test result in 99% of the cases in which the patient actually has illness X (true positive) Y delivers a positive test result in 10% of the cases in which the patient does not have X (false positive) P(x) = 0.01 P(y x) = 0.99 P(y x) = 0.1 P(X Y ) = P(Y X)P(X) P(Y ) P(y) = x P(y x)p(x) = P(y x)p(x) + P(y x)p( x) P(x y) = (1 0.01) = = Correct answer is (d) 9.1% 21

26 Multi-armed bandits image credits: Microsoft Research There are n machines Each machine i returns a reward y P(y; θ i ) The machine s parameter θ i is unknown Goal is to maximize the reward, collected over the first T trials 22

27 Applications Online advertisement Clinical trials Efficient optimization Bandit problems are commercially very relevant 23

28 The bandit problem is an archetype for Sequential decision making Decisions that influence knowledge as well as rewards/states Exploration/exploitation The same aspects are inherent also in global optimization, active learning & reinforcement learning The Bandit problem formulation is the basis of Upper Confidence Bounds (UCB) which is the core of several planning and decision making methods 24

29 Formal problem definition Let a t {1,.., n} be the choice of machine at time t Let y t R be the outcome/reward A policy or strategy maps all the history to a new choice: π : [(a 1, y 1 ), (a 2, y 2 ),..., (a t 1, y t 1 )] a t Problem: Find a policy π that: max T t=1 y t max y T max t=1 γ t y t maximizes the sum over all outcomes maximizes the last outcome maximizes discounted infinite horizon 25

30 Exploration, exploitation Two effects of choosing a machine: Collect more data about the machine knowledge Collect reward For example: Exploration: Choose the next action a t to min H(b t ) Exploitation: Choose the next action a t to max y t 26

31 Upper Confidence Bound (UCB1) 1: Initialization: Play each machine once 2: repeat 3: Play the machine i that maximizes ŷ i + β 2 ln n n i 4: until ŷ i is the average reward of machine i so far n i is how often machine i has been played so far n = i n i is the number of rounds so far β is often chosen as β = 1 27

32 UCB algorithms UCB algorithms determine a confidence interval such that with high probability. ŷ i σ i < y i < ŷ i + σ i UCB chooses the upper bound of this confidence interval: Optimism in the face of uncertainty UCB selects the action with the largest (estimated) upper bound Strong bounds on the regret (sub-optimality) of UCB1 The bound is derived from the Hoeffding inequality See Finite-time analysis of the multiarmed bandit problem, Auer, Cesa-Bianchi & Fischer, Machine learning,

33 Bayesian bandits So far we have made no assumptions about the reward distribution p(y) Bayesian bandits exploit prior knowledge on reward distribution They compute posterior distribution of rewards p(y h t ) where h t is the history h t = a 1, y 1, a 2, y 2,... a t 1, y t 1 We use the posterior to guide exploration Better performance if prior knowledge is accurate 29

34 UCB for Gaussian Assume y i N (y; µ i, σ 2 i ) We compute the Gaussian posterior (Bayes rule): p(µ i, σ i 2 h t ) p(µ i, σ 2 i ) Pick action that maximizes µ i + β σ i ni t a t=i N (y t µ i, σ 2 i ) 30

35 UCB - Discussion UCB over-estimates the reward-to-go (under-estimates cost-to-go), just like A - but does so in the probabilistic setting of bandits The fact that regret bounds exist is great! UCB became a core method for algorithms to decide what to explore In tree search, the decision of which branches to explore further is itself a decision problem. An intelligent agent like UBC, can be used within the search to make decisions about how to grow the tree. 31

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples Machine Learning Bayes Basics Bayes, probabilities, Bayes theorem & examples Marc Toussaint U Stuttgart So far: Basic regression & classification methods: Features + Loss + Regularization & CV All kinds