Genome 541! Unit 4, lecture 3! Genomics assays

Size: px

Start display at page:

Download "Genome 541! Unit 4, lecture 3! Genomics assays"

Marvin Hunt
5 years ago
Views:

1 Genome 541! Unit 4, lecture 3! Genomics assays

2 Much easier to follow with slides. Good pace.! Having the slides was really helpful clearer to read and easier to follow the trajectory of the lecture.!! Linear algebra! A single slide of linear algebra / matrix notation might have helped.! For people who have not taken a math course for several years, it was hard to follow the convexity part of the lecture.! We all generally understand derivatives/hessians, not useful to do at this level of detail and generally not relevant.!! Convexity stuff! More time preparing examples would make the math more understandable.! The convexity proof was a bit confusing.

3 Is exp(x) convex? Proof from second derivative: d exp(x) =expx x d 2 exp(x) x 2 =expx 0 Proof by picture: Proof from definition: exp( x +(1 )x 0 )=exp( x)exp((1 )x 0 ) exp(x)+(1 )exp(x 0 ) (Arithmetic Mean / Geometric Mean inequality)

4 Is sin(x) convex? Counterexample from definition: (1/2) sin(2 )+(1/2) sin(0) = (1/2) 0+(1/2) 0 < sin((1/2) 0+(1/2) 2 ) =sin( ) =1 Counterexample by! second derivative: d sin(x) dx = cos(x) d 2 sin(x) dx 2 = sin(x) sin( ) = 1 < 0 Disproof by picture:

5 Is x 2 convex?

6 Is x 3 convex?

7 Today s class Genomics assays! Problem of the day: Can we predict TF binding at motifs based on DNA sensitivity data?! Convex optimization (without constraints)! Other problems:! Imputing the output of genomics data assays that haven t been performed! Unbiased discovery of functional elements and functional element types.

8 ChIP-exo has better spatial resolution than ChIP-seq

DamID measures TF binding through a fusion protein Dam+TF fusion protein Measure methylation at! GATCs DamID vs. ChIP-seq:! DamID can be easier! ChIP requires (specific) antibody!

9 DamID measures TF binding through a fusion protein Dam+TF fusion protein Measure methylation at! GATCs DamID vs. ChIP-seq:! DamID can be easier! ChIP requires (specific) antibody! DamID requires fusion protein! DamID can t query post-transcriptional modification (histone mods)! ChIP has better spacial resolution! ChIP is limited by cross-linking bias! DamID is limited by GATC content and Dam reactivity! ChIP has better temporal resolution: Dam acts over ~24 hours

10 DNase-seq and ATAC-seq measure DNA accessibility Raj and McVicker. Nature Methods 2014

11 High-depth DNase-seq (DNase digital genomic footprinting (DGF)) measures bp-level TF binding Neph et al. Nature 2012

12 Paired-end DNase (DNase-FLASH) and ATACseq measure nucleosome architecture Vierstra et al. Nature Methods 2014

13 Converting genomics data sets into signal tracks Extend according to fragment length Sum Account for biases

14 Accounting for biases Expected signal: Average of control experiment in (for example) 1kb window.! Two representations of signal data:! Fold enrichment: observed / expected! Poisson p-value:! -log10(pois(observed expected) / Pois(observed observed))

15 Today s class Genomics assays! Problem of the day: Can we predict TF binding at motifs based on DNA sensitivity data?! Convex optimization (without constraints)! Other problems:! Imputing the output of genomics data assays that haven t been performed! Unbiased discovery of functional elements and functional element types.

16 Problem: Predict TF binding at motifs from DNase/ATAC-seq data

17 ! Several methods have been developed to address this problem FIMO+prior: Use prior based on DNase for motif scanning.! CENTIPEDE: Logistic regression.! Deep neural network.! PIQ: Protein Identification Quantification. Generative model.

18 Bound factors leave identifiable DNase-seq profiles Individual binding site prediction is difficult: Individual CTCF: Aggregate CTCF: MIT

19 Idea: Learn a characteristic profile for each TF CTCF bound? CTCF DNase read! counts Binding effect window = 400bp

20 Idea: Nearby factors contribute to profile Genomic position CTCF Oct4 Bound? DNase read! counts Binding effect window = 400bp

21 Overdispersion model: Poisson over normal Latent signal strength (μ) µ i N(µ 0, ) x i Pois(exp(µ i )) DNase read! counts (x)

22 Overdispersion model: Poisson over normal Latent signal strength (μ) µ i N(µ 0, ) x i Pois(exp(µ i )) DNase read! counts (x)

23 Binding model CTCF Binding indicator (I) Binding effect! (βctcf) Background! accessibility Signal strength (μ) DNase read! counts (x) Binding effect window = 400bp Binding! effect

24 Full model CTCF Oct4 Bound? Latent! intensity DNase read counts MX ˆµ i = µ i + i=1 m=1 I m Tm (i) NY P (X 1...N µ 1...N,I 1...M )= Pois (X i exp (ˆµ i ))

25 Today s class Genomics assays! Problem of the day: Can we predict TF binding at motifs based on DNA sensitivity data?! Convex optimization (without constraints)! Other problems:! Imputing the output of genomics data assays that haven t been performed! Unbiased discovery of functional elements and functional element types.

26 Gradient descent Gradient descent update step: x (k+1) x (k) + trf(x (k) ) t: Learning rate. Must be chosen.

27 Convergence properties of gradient descent! Gradient descent is guaranteed to converge if t is low enough.! The value of t depends on the curvature of the function. If f(x) satisfies!! krf(x) rf(y)k 2 apple Lkx yk 2 then gradient descent with t 1/L is guaranteed to converge.

28 Pros and cons of gradient descent Pros:! Very simple and easy to implement.! Requires computing only first derivative.! Updates to each variable can be computed independently.! Cons:! Requires tuning learning rate.! Slower than other methods.

29 Aside: Optimizing quadratic functions P is positive semi-definite. df dx = Px+ q df dx =0 =) x = P 1 q Optimum can be computed in closed-form.

30 Newton s method Newton s method update step: x (k+1) x (k) +(r 2 f(x (k) )) 1 rf(x (k) ) Newton s method minimizes second-order Taylor expansion of f:

31 Pros and cons of Newton s method Pros:! Extremely fast.! Guaranteed to converge (no hyperparameters).! Cons:! Requires second derivative.! Updates to each variable are not independent.

32 Variant: Coordinate descent Update each variable (or subset of variables) separately (using any method).!!!!!!! Works best when each subset has closed-form updates.

33 Variant: Stochastic gradient descent Choose a random example (or subset of examples) for each update step.! Usually much faster than gradient descent.! Requires decreasing the learning rate with each iteration in order to guarantee convergence.

34 PIQ optimization P (X 1...N µ 1...N,I 1...M )= ˆµ i = µ i + MX m=1 I m Tm (i) NY Pois (X i exp (ˆµ i )) i=1 Pois(X exp(µ)) = (1/X!) exp(xµ exp(µ)) NX `( ) = log P (X µ, I) = log 1 X! + X i ˆµ i exp(ˆµ i ) T (j) = ˆµ i (X i exp(ˆµ i T ˆµ T (j) i=1 = 1 if position i is the j th position of a motif of T. 0 otherwise.

35 PIQ optimization P (X 1...N µ 1...N,I 1...M )= ˆµ i = µ i + MX m=1 I m Tm (i) NY Pois (X i exp (ˆµ i )) i=1 Pois(X exp(µ)) = (1/X!) exp(xµ exp(µ)) NX `( ) = log P (X µ, I) = log 1 X! + X i ˆµ i exp(ˆµ i ) T (j) = ˆµ i (X i exp(ˆµ i T ˆµ T (j) i=1 = 1 if position i is the j th position of a motif of T. 0 otherwise.

36 PIQ optimization P (X 1...N µ 1...N,I 1...M )= ˆµ i = µ i + MX m=1 I m Tm (i) NY Pois (X i exp (ˆµ i )) i=1 Pois(X exp(µ)) = (1/X!) exp(xµ exp(µ)) NX `( ) = log P (X µ, I) = log 1 X! + X i ˆµ i exp(ˆµ i ) T (j) = ˆµ i (X i exp(ˆµ i T ˆµ T (j) i=1 = 1 if position i is the j th position of a motif of T. 0 otherwise.

37 PIQ predicts ChIP-seq peaks accurately

38 Today s class Genomics assays! Problem of the day: Can we predict TF binding at motifs based on DNA sensitivity data?! Convex optimization (without constraints)! Other problems:! Imputing the output of genomics data assays that haven t been performed! Unbiased discovery of functional elements and functional element types.

39 Semi-automated genome annotation algorithms partition and label the genome on the basis of functional genomics tracks H3k36me3 DNase1 RNA-seq Annotation HMMSeg: Day et al. Bioinformatics, 2007! ChromHMM: Ernst, J. and Kellis, M. Nature Biotechnology, 2010! Segway: Hoffman, M et al. Nature Methods, 2012

40 Semi-automated genome annotation algorithms use dynamic Bayesian network models Segment label DNase1 H3K27me3 RNA-seq hidden random variable observed random variable 40

41 Semi-automated genome annotation recovers known types of genome elements Enhancer Gene CTCF Hoffman et al. Nucleic Acids Research 2012.

42 Semi-automated genome at course resolution discovers chromatin domain types Quiescent domains Constitutive heterochromatin Facultative heterochromatin H3K9me3 X H3K27me3 QUI CON FAC Specific SPC expression domains Regulatory! elements BRD Broad expression domains 42

43 Today s class Genomics assays! Problem of the day: Can we predict TF binding at motifs based on DNA sensitivity data?! Convex optimization (without constraints)! Other problems:! Imputing the output of genomics data assays that haven t been performed! Unbiased discovery of functional elements and functional element types.

44 Problem: Can we impute the output of missing experiments? 316 assay types Experiment performed 346 cell types DnaseSeq H3K4me3 H3K27me3 H3K36me3 H3K9me3 H3K4me1 H3K27ac H3K9ac CTCF DnaseDgf K562 GM12878 H1-hESC HepG2 HeLa-S3 A549 IMR90 HUVEC NHEK H9ES MCF-7 ENCODE Roadmap Epigenomics

45 c t 0 Prediction function c t ō ct,m t,p m t f ct 0 (o ct 0,m t,p F ct 0,m t,p) Machine learning predictor: regression trees ō ct,m t,p = 1 C mt X c t 0 2C mt f ct0 (o ct0,m t,p F ct0,m t,p) Ernst and Kellis. Nature 2015.

46 Features c t 0 c t Other mark! (same sample)! features m t Other samples! (same mark)! features Ernst and Kellis. Nature 2015.

47 Imputed data has high correlation with observed data Ernst and Kellis. Nature 2015.

48 Imputed data recovers promoters and TSSs better than observed data Ernst and Kellis. Nature 2015.

49 Administrivia Homework 7 is up online. Due Friday ! Homework 8 will be up by the end of the week.! Next class: Chromatin 3D architecture.! Please write 1-minute responses.

Genome 541 Gene regulation and epigenomics Lecture 2 Transcription factor binding using functional genomics

Genome 541 Gene regulation and epigenomics Lecture 2 Transcription factor binding using functional genomics I believe it is helpful to number your slides for easy reference. It's been a while since I took