Learning Patterns for Detection with Multiscale Scan Statistics

Size: px

Start display at page:

Download "Learning Patterns for Detection with Multiscale Scan Statistics"

Neil Carroll
5 years ago
Views:

1 Learning Patterns for Detection with Multiscale Scan Statistics J. Sharpnack 1 1 Statistics Department UC Davis UC Davis Statistics Seminar 2018 Work supported by NSF DMS

2 Hyperspectral Gas Detection

3 Hyperspectral Gas Detection Each pixel of the image is a light spectrum, and we are interested in detecting a single chemical signature. Images from [Manolakis et al. 2014, Manolakis & Shaw 2002]

4 Hyperspectral Gas Detection Gas signature level in each pixel. Dataset and chemical signature comparison code from Dimitris G. Manolakis and DTRA.

5 Hyperspectral Gas Detection Decreased SNR by 1 5th. Can you see it?

6 Hyperspectral Gas Detection If classification answers the question, what am I seeing? detection answers the question, do I see anything at all?

7 Anomaly detection goals Goal 1: Reliably detect anomalous patterns within images beyond what the human eye can see at the precise information theoretic limit.

8 Detection applications Contaminant detection in water networks, real-time surveillance system, radiation monitoring, fire detection and other remote sensing applications, medical imaging and automated radiology, early detection of pathogen outbreaks.

9 Rectangular multiscale scan statistic Scan every rectangle (vary location and scale) looking for abnormal concentrations [S, Arias-Castro 16]. Figure: A rectangle over the active region.

10 Scan statistic Represent image over [ L, L] [ L, L] as matrix Y and consider scanning rectangle over pixels [ H 0, H 0 ] [ H 1, H 1 ]. Define the pattern to be P k,l = 1 (2H0 + 1) (2H 1 + 1) then the scan is the following convolution, (P Y ) k,l = H 0 k = H 0 H 1 l = H 1 Y k k,l l P k,l, where defined. The single-scale scan statistic is ŝ = max (P Y ) k,l. k,l

11 Other patterns General pattern over the domain Ω = [ L, L] d can be hidden in the noisy tensor. Figure: A simulated time series with an embedded sinusoidal signal with values on the y-axis (d = 1).

12 Multiscale scan Scanning with many pattern dimensions, H 0,..., H d, is called a multiscale scan. How do we compare scan statistics at different scales?

13 Anomaly detection goals Goal 2: General purpose analysis for multiscale scan statistics for a large class of patterns.

14 Multiple Tensors Figure: Multiple tensors, i = 1,..., n (n = 5), with different locations and scales, and two possible patterns (left and right).

15 Anomaly detection goals Goal 3: Leverage database of tensors to simulaneously learn and detect anomalous patterns.

16 Prior work [Naus 65] scan statistics introduced for point cloud data [Siegmund, Worsley 95] limit distribution of 1-dimensional scan [Glaz and Zhang 04, Kabluchko 11] limit in d-dimensions [Arias-Castro et al. 05, 11] scan for blob-like patterns [Dumbgen, Spokoiny, 01] scale adaptive scan statistic (d = 1) [S, Arias-Castro 16] scale adaptive rectangular scan [Proksch at al. 17] scale adaptive smooth patterns This work: learning and detecting general smooth patterns in a database of tensors with scale adaptive methods

17 Continuous model Pattern f F C 1 over [ 1, 1] d, f L2 = 1. Data is random measure dx i with domain [ L, L] d. Scale dilation f h := h 1/2 f (./h), h = j h j, h R d Null hypothesis: data is just noise (dw i is d-dimensional Wiener process) Alternative hypothesis: there is a signal f at location t i, and scale h i. H 0 : dx i (τ) = dw i (τ), i = 1,..., n H 1 : dx i (τ) = µf h i (t i τ)dτ + dw i (τ) for some f F, and (t i, h i ) D, i = 1,..., n.

18 Continuous multiscale scan statistic Convolution at scale h, (f h dx i )(t) = f h (τ)dx i (t τ) = 1 h f (τ)dx i (t hτ), Scale corrected multiscale scan statistic: s(x i ; f ) := max h H v h h H := j [1, L) t T h := j [ (L h j ), L h j ] v h = 2 j log(l/h j) ( max t T h (f h dx i )(t) v h ). (1) Test if the pattern f centered at t and scaled by h is hidden within tensor X i.

19 Learning patterns Given a dataset of images X i, i = 1,..., n then can we also learn the pattern f F? S n (X ; F) := max f F 1 n n s(x i ; f ) i=1 (PAMSS) The pattern adapted multiscale scan statistic (PAMSS) averages the MSS for each tensor. Smoothness conditions on F are required: bounded variation (TVC) or average Hölder condition (AHC).

20 Smoothness assumptions Assumption TVC: Define the isotropic total variation, f TV := f (u) 2 du, function are of bounded variation, Ω γ 1 > 0 s.t. f F, f TV γ 1. (TVC) or Assumption AHC: Define the Hölder functional, A t,s (f ) := f (t z) f (s z) 2 dz. Ω L functions have bounded average Hölder condition, 0 < γ 2 1 s.t. f F, A t,s (f ) c A t s 2γ 2 2. (AHC)

21 Type 1 error control We can simulate from the null distribution to obtain a significance level for ( ) s(x i ; f ) := max v h max(f h dx i )(t) v h. (2) h H t T h [Dumbgen & Spokoiny 01] showed that under H 0 for L large enough s(x i, f ) = O P (log log L) for functions satisfying (TVC) in 1D. We need a more precise control to analyze PAMSS (tail bound) s(x i, f ) is subexponential random variable.

22 SubGaussian process Definition We say that a random field, {Z(ι)} ι I, is a (zero mean) standard subgaussian process if there exists a constant u 0 > 0 such that P { Z(ι 0 ) Z(ι 1 ) u} 2 exp P {Z(ι 0 ) u} exp ( u2 2ν(ι 0, ι 1 ) ), (3) ) ( u2, (4) 2 for any ι 0, ι 1 I, u > u 0, and ν(ι 0, ι 1 ) = E(Z(ι 0 ) Z(ι 1 )) 2, is the canonical distance. Under H 0, {(f h dx i )(t) : (t, h) D} is a subgaussian random field with canonical distance ν f ((h 0, t 0 ), (h 1, t 1 )) := f h0 (t 0.) f h1 (t 1.) L2.

23 Dudley s chaining Define the covering number of metric space (I, ν), N (I, ν, ɛ), to be the number of balls of ν-radius ɛ that is required to cover I. Theorem (Dudley s entropy (tail) bound) { } P sup Z(η) > u c D + C η I Ce u2 2, such that D = 0 2 log N (I, ν, ɛ)dɛ for universal constants c, C. Specifically, if N (I, ν, ɛ) Γɛ ρ. then (as Γ ) D = 2 log Γ + o(1).

24 Dudley s chaining Find a sequence (the chain) {η k } k=0 such that lim k η k = η, sup Z(η) = sup Z(η 0 ) + (Z(η 1 ) Z(η 0 )) + (Z(η 2 ) Z(η 1 )) +... η I η 0,η 1,...

25 Dudley s chaining Find a sequence (the chain) {η k } k=0 such that lim k η k = η, sup Z(η) = sup Z(η 0 ) + (Z(η 1 ) Z(η 0 )) + (Z(η 2 ) Z(η 1 )) +... η I η 0,η 1,... Z(η i+1 ) Z(η i ) is prop. to ν(η i+1, η i ) Choose covering layers such that N grows like 2 2k Variance of the bound is dominated by Z(η 0 )

26 Chaining standardized suprema Theorem Let Z(η) be a standard subgaussian process over an index set I. Suppose that metric (I, d Z ) has covering number, N s.t., N (I, d Z, ɛ) Γɛ ρ. (5) Then there exists an Γ 0 > 0 such that for any Γ Γ 0, the following supremum is bounded in probability, ( ) } { c0 P log Γ sup Z(η) 2 log Γ a 0 log log Γ > u e u, η I (6) for u > u 0 where u 0, c 0, a 0 are constant depending on ρ (but not on Γ). In words, the supremum of such a subgaussian process is subexponential with location and rate parameter, (2 log Γ) 1/2 (omitting the log log term).

27 Proof sketch for chaining bound For iid normals, {z i } N i=1, from union bound { P max z i > } 2 log N + u 2 e u2 2. i Generic chaining: 2 log N + u 2 u + 2 log N/(2u) Our chaining: 2 log N + u 2 2 log N + u 2 /(2 2 log N) { P 2 ( 2 log N max z i ) 2 log N i Modify chain so that } > u e u. (1) start the chain at a deeper level (N large enough) (2) make the covers grow slowly N a ak for a 1.

28 Main Theorem Theorem Let F be finite and assume that either all functions in F satisfy either (TVC) or (AHC). Let K log F n (δ) := ( F δ K n log ( F δ then for some constant K, under H 0, (7) ) ), log F n K + log δ, log F > n K + log δ P {S n (X, F) > F n (δ) log log L} δ. (8)

29 Proof of main theorem Lemma Then, under the above conditions, there is a constant C depending on d alone such that 1. Suppose that (TVC) holds for the class F, then ν f ((t, h), (t, h )) 2 Cγ 1 t t 2 ( ) 2 h h + 1. h 2. [Proksch et al. 16] Suppose that (AHC) holds for the class F, then ν f ((t, h), (t, h )) 2 t j t 2γ 2 j h C + j h 2γ 2 j h j 2γ 2 h j h j. 2 2γ 2

30 Proof of main theorem Lemma Suppose that f F satisfies either (TVC) or (AHC). Let l {0,..., log 2 L } d, and H 2 (l) = j [2 l j, 2 l j +1 ]. Then { } ( P c 1 max v h (fh dx i ) )(t) v h a1 log log L > u e u h H l,t T h for constants a 1, c 1 > 0 depending on γ, d only. With the union bound over l, { P c 2 sn(x i }, f ) log log L a 2 > u e u, then use subexponential Bernstein inequality.

31 Asymptotic Distinguishability Corollary Suppose that log F = o(n), and recall that under the alternative hypothesis, H 1, X i has an embedded pattern f at scale h i and v 2 = h i j log(l/hi j ), and the noise is a standard Wiener process. Suppose also that hj i Lc for some 0 c < 1 for all i, j, then the PAMSS is asymptotically powerful (has diminishing probability of type 1 and type 2 error) if µ 2 n i=1 v 2 h i n i=1 v h i. (9) We take this result to mean that as long as the function class, F, does not grow exponentially in n, we achieve asymptotic power under the same conditions as if F = 1.

32 Summary Proposed the Pattern Adapted Multiscale Scan Statistic for learning anomalous patterns We proved a refined concentration result for the supremum of subgaussian processes We controlled the error probabilities for the PAMSS showing that it can learn the function from a class for free as long as n log F For a link to this paper: Thanks!

Information theoretic perspectives on learning algorithms

Information theoretic perspectives on learning algorithms Varun Jog University of Wisconsin - Madison Departments of ECE and Mathematics Shannon Channel Hangout! May 8, 2018 Jointly with Adrian Tovar-Lopez