Towards a Bayesian model for Cyber Security

Size: px

Start display at page:

Download "Towards a Bayesian model for Cyber Security"

Rafe Murphy
5 years ago
Views:

1 Towards a Bayesian model for Cyber Security Mark Briers (mbriers@turing.ac.uk) Joint work with Henry Clausen and Prof. Niall Adams (Imperial College London) 27 September 2017 The Alan Turing Institute 1

2 The Alan Turing Institute is the national centre for data science, headquartered at the British Library. 27 September 2017 The Alan Turing Institute 2

3 Priorities at The Alan Turing Institute Engineering Technology Defence & security Smart cities Financial services Health & wellbeing Mathematical representations Inference and learning Systems and platforms Understanding human behaviour 27 September 2017 The Alan Turing Institute 3

4 Motivation Uncertainty representation There exists a large amount of expertise in the Cyber domain There is often a significant amount of prior information used in statistical modelling A Bayesian approach should be considered for rigorous quantification of uncertainty in this environment Long-term goal is to incrementally develop a full Bayesian model for Cyber security This presentation introduces a probabilistic model for detecting the state of a device 27 September 2017 The Alan Turing Institute 4

5 Contents An overview of sequential Bayesian inference State-space models Pseudo-marginal distribution An overview of a Markov Modulated Poisson Process Application to Cyber security Next steps 27 September 2017 The Alan Turing Institute 5

6 Introduction State-space models Interested in quantifying the uncertainty associated with an unobserved variable, X, given some information pertaining to that variable, Y. In a sequential context, this relates to the calculation of the posterior distribution, P (X 1:T 2 dx 1:T jy 1:T = y 1:T ), where x 1:T = fx 1 ; : : : ; x T g, y 1:T = fy 1 ; : : : ; y T g, are generic points in path space of the signal and observation processes and t 2 N +. Standard Markov assumptions: X t jx t 1 = x t 1 f ( jx t 1 ); Y t jx t = x t g( jx t ): 27 September 2017 The Alan Turing Institute 6

7 Introduction Prediction, filtering and smoothing Many state-space applications interested in p(x t jy 1:l ). if l < t then prediction; if l = t then filtering; if l > t then smoothing. We will concentrate on the latter of these. Two available formulations: Forward-backward smoothing; Two-filter smoothing. 27 September 2017 The Alan Turing Institute 7

8 Probabilistic recursion Filtering Filtering is the term used to describe the process of recursively calculating the marginal posterior distribution and is required to perform smoothing. By Bayes rule: p(x t jy 1:t ) = p(x t jy 1:t 1 ; y t ) p(x t jy 1:t 1 ) = = / g(y t jx t )p(x t jy 1:t 1 ); Z Z p(x t ; x t 1 jy 1:t 1 )dx t 1 f (x t jx t 1 )p(x t 1 jy 1:t 1 )dx t 1 ; 27 September 2017 The Alan Turing Institute 8

9 Probabilistic recursion Forward-backward smoothing Compute filtered and predicted distributions in a forward filtering recursion and then execute a backward recursion as follows: Z p(x t jy 1:T ) = p(x t ; x t+1 jy 1:T )dx t+1 Z = p(x t+1 jy 1:T )p(x t jx t+1 ; y 1:T )dx t+1 Z = p(x t+1 jy 1:T )p(x t jx t+1 ; y 1:t )dx t+1 Z p(xt+1 jy 1:T )f (x t+1 jx t ) = p(x t jy 1:t ) dx t+1 : p(x t+1 jy 1:t ) 27 September 2017 The Alan Turing Institute 9

10 Probabilistic recursion Two-filter smoothing Combines forward time filter with backward time filter: p(x t jy 1:T ) = p(x t jy 1:t 1 ; y t:t ) = p(x tjy 1:t 1 )p(y t:t jy 1:t 1 ; x t ) p(y t:t jy 1:t 1 ) / p(x t jy 1:t 1 )p(y t:t jx t ): 27 September 2017 The Alan Turing Institute 10

11 Probabilistic recursion Two-filter smoothing A backward information filter can be used to calculate p(y t:t jx t ) sequentially from p(y t+1:t jx t+1 ) p(y t:t jx t ) = = Z Z p(y t ; y t+1:t ; x t+1 jx t )dx t+1 p(y t+1:t jx t+1 )f (x t+1 jx t )g(y t jx t )dx t+1 : Problem: p(y t:t jx t ) is a function of x t (and so not a pdf) - R p(y t:t jx t )dx t is not necessarily finite so SMC methods cannot be used. 27 September 2017 The Alan Turing Institute 11

12 Probabilistic recursion Solution options Solution 1: Restrict the models used - not good in practice. Solution 2 (our solution): Introduce an artificial prior distribution that gets removed when one forms the smoothing distribution. 27 September 2017 The Alan Turing Institute 12

13 Probabilistic recursion: solution Backward recursion constructed as follows: p(x t jy 1:T ) /p(x t jy 1:t 1 )p(y t:t jx t ) / p(x tjy 1:t 1 )ep(x t jy t:t ) : t (x t ) where: and ep(x t jy t:t ) = g(y tjx t )ep(x t jy t+1:t ) R g(yt jx 0 t )ep(x0 t jy t+1:t )dx 0 ; t Z ep(x t jy t+1:t ) ep(xt+1jy t+1:t ) f(x t+1jx t ) t (x t ) dx t+1 : t+1 (x t+1 ) 27 September 2017 The Alan Turing Institute 13

14 Importance sampling approximation Importance sampling! computation of Monte Carlo estimates e.g. expectations E (x) [f(x)]: Z Z f(x)(x)dx = NX i=1 f(x) (x) q(x) q(x)dx w (i) f(x (i) ) 27 September 2017 The Alan Turing Institute 14

15 Importance sampling approximation 27 September 2017 The Alan Turing Institute 15

16 SMC approximation Sequential importance sampling! sequential (recursive) computation of Monte Carlo estimates Sample X (i) t q( jx (i) t 1 ; y t) Update weight: w (i) t / w (i) t 1 f(x (i) t jx (i) t 1 )g(y tjx (i) t ) q(x (i) t jx (i) t 1 ; y t) 27 September 2017 The Alan Turing Institute 16

17 SMC approximation 27 September 2017 The Alan Turing Institute 17

18 Forward-backward smoothing SMC-based approximation Forward-backward smoothing: NX p(x t jy 1:T ) bp(x t jy 1:T ) = w (i) tjt X (i) t i=1 (x t ); where w (i) tjt denotes the (smoothed) weight of the ith particle at time t conditioned on the data y 1:T. Problem: Reliance on particle support of the filtering distribution for approximation to smoothing distribution. Solution: Two-filter particle smoother. 27 September 2017 The Alan Turing Institute 18

19 Two-filter SMC algorithm Initialise at time t = T. For i = 1; : : : ; N, sample X f(i) T eq( jy T ). For i = 1; : : : ; N, compute and normalise the importance weights: ew (i) T / T ( X f(i) T )g(y T jx f(i) eq( f X (i) T jy T ) T ) : 27 September 2017 The Alan Turing Institute 19

20 Two-filter SMC algorithm For times t = (T 1); : : : ; 1, For i = 1; : : : ; N, sample f X (i) t eq( j f X (i) t+1 ; y t). For i = 1; : : : ; N, compute and normalise the importance weights: ew (i) t / ew (i) t+1 ) t ( X f(i) t )f( X f(i) jf t+1 X (i) t ) t+1 ( X f(i) )eq(f t+1 X (i) t jx f : (i) t+1 ; y t) g(y t j f X (i) t 27 September 2017 The Alan Turing Institute 20

21 Two-filter SMC algorithm: weight calculation bp(x t jy 1:T ) = / Z N X NX j=1 i=1 ew (j) t w (i) t 1 X (i) t 1 NX i=1 f w (i) t 1 1 (x t 1 ) A f(xt jx t 1 ) fx (j) t jx (i) t 1 t fx (j) t (x t f X (j) PN j=1 ew(j) t (x t X f(j) t ) t ): t (x t ) dx t 1 Based on the support of the forward and backward filters Intuitively, should give a better approximation 27 September 2017 The Alan Turing Institute 21

22 Parameter estimation using EM In many cases of interest, the state-space model depends on unknown parameters 2. That is, assume is independent of. X t jx t 1 = x t 1 f ( jx t 1 ); Y t jx t = x t g ( jx t ); Use EM algorithm to maximise (log) likelihood. 27 September 2017 The Alan Turing Institute 22

23 Parameter estimation using EM This iterative algorithm proceeds as follows: given a current estimate (i 1) of then where (i) = arg max 2 Q (i 1) ; Q (i 1) ; =E (i 1) [log (p (x 1:T ; y 1:T ))j y 1:T ] Z = log (p (x 1:T ; y 1:T ))p (i 1)(x 1:T j y 1:T ) dx 1:T : 27 September 2017 The Alan Turing Institute 23

24 Bayesian modelling of event data Netflow We will consider netflow data as a point process Develop a Markov Modulated Poisson Process model that outputs probability estimates of the device state Can we model user present vs user not present (for e.g. UADs)? 27 September 2017 The Alan Turing Institute 24

25 Exploratory Data Analysis - LANL netflow "No modelling plan survives first contact with data" - N Lawrence We have a model selection problem to solve too (future work!) 27 September 2017 The Alan Turing Institute 25

26 Markov Modulated Poisson Process Definition Informal: An inhomogeneous Poisson process whose rate is governed by a continuous time Markov chain i.e. The intensity of the Poisson process is i when X(t) = i, i = 1; : : : ; r < 1 Also known as a doubly stochastic Cox process For notational convenience, let f 1 ; : : : ; r g T and diag() Intuitively, we use the netflow event (Point process) data to estimate the hidden state of the device 27 September 2017 The Alan Turing Institute 26

27 Continuous-time Markov chain Notation Introduce the infinitesimal generator of a continuous-time Markov chain, Q = fq i;j g i;j2e, where E is a finite index set Let X k be the state of the MMPP at the time of the kth event and let Y k be the time difference between event k and k 1 It is possible to view fx k ; Y k g 1 k=1 as a Markov renewal process, with transition density matrix: f (y) = exp f(q )yg For each device, we wish to estimate the state of the hidden process and the model parameters 27 September 2017 The Alan Turing Institute 27

28 Joint parameter and state estimation It is "straightforward" to estimate the joint distribution of the hidden states fx(t)g, and the parameters (Q; Λ) Estimation can be done via Expectation Maximisation or Gibbs sampling For the latter, a hierarchical model is utilised, with appropriate prior distributions specified Both estimation procedures require construction of the smoothed distribution, P (X(t ) = i; X(t) = jjy 1:n ) 27 September 2017 The Alan Turing Institute 28

29 Backward filtering algorithm We need to ensure that the backward recursion computes a probability measure (rather than a finite measure in the general case) ( ) is chosen to allow the construction of the backward Markov kernel This requires the marginal distribution at each time step: P (X(t k )) / P (X(t k 1 )) exp f(q )g Using this quantity, we are able to compute the backward information filtering quantity (as a probability measure) ~P (X(t k )jy k+1:n ) = ~ P (X(t k+1 )jy k+1:n )P (X(t k )jx(t k+1 )) 27 September 2017 The Alan Turing Institute 29

30 Results - state estimation 27 September 2017 The Alan Turing Institute 30

31 Results - parameter estimation 27 September 2017 The Alan Turing Institute 31

32 Data - known user activity 27 September 2017 The Alan Turing Institute 32

33 Results - known user activity 27 September 2017 The Alan Turing Institute 33

Summary Initial model Our goal is to devise a large-scale probabilistic model that incorporates as much prior information as possible Started with simplistic model -

34 Summary Initial model Our goal is to devise a large-scale probabilistic model that incorporates as much prior information as possible Started with simplistic model - device state estimation from event data Used an MMPP to model data Introduced a procedure for parameter estimation of (Q; Λ) 27 September 2017 The Alan Turing Institute 34

35 Summary Initial model Used this model and parameter estimates to produce estimate of device state Future work Online parameter estimation What is an appropriate number of states and what do they correspond to? A heirarchical model may offer a solution to this problem... Welcome collaborations (mbriers@turing.ac.uk or m.briers@imperial.ac.uk) 27 September 2017 The Alan Turing Institute 35

36 27 September 2017 The Alan Turing Institute 36

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 2: From Linear Regression to Kalman Filter and Beyond January 18, 2017 Contents 1 Batch and Recursive Estimation 2 Towards Bayesian Filtering 3 Kalman Filter and Bayesian Filtering and Smoothing