Genome 541! Unit 4, lecture 2! Transcription factor binding using functional genomics

Size: px

Start display at page:

Download "Genome 541! Unit 4, lecture 2! Transcription factor binding using functional genomics"

Leon Tyler
5 years ago
Views:

1 Genome 541 Unit 4, lecture 2 Transcription factor binding using functional genomics

2 Slides vs chalk talk: I m not sure why you chose a chalk talk over ppt. I prefer the latter no issues with readability or time wasted writing things. Having slides can be useful for note-taking and going back afterward. Some of the equations were hard to read, so difficult to copy in notes. It would be really helpful to have more slides. It s difficult to read the writing on the board, and clearer diagrams would be quite helpful. Slides would have been helpful for equations and charts. Very clear. But better with slides. Content, organization, pace: The lecture story might be easier if there was an outline left on the board. More details on MEME (what the exact likelihood calculation is), perhaps instead of FWER, which most of us have seen. Question: where did you learn about FWER? Clear and interesting, good pace. The material was too densely presented to digest if I hadn t seen this material before.

3 False discovery rate vs. false positive rate TP: True positives TN: True negatives FP: False positives False discovery rate: FP / (FP + TP) False positive rate: FP / (FP + TN)

4 Statistical power Statistical power: The probability you reject the null hypothesis if the alternative hypothesis is true (at a given p-value threshold). We want to maximize power at a given significance level.

5 Accuracy-related terms cheat-sheet Formula TP TN FP FN TP / (TP + FN) TP / (TP + FP) TN / (TN + FP) (TP + TN) / (TP+TN+FP+FN) Names True positves True negatives False positives Type I errors False negatives Type II errors Recall Sensitivity True positive rate Power 1 - False negative rate Precision Positive predictive value 1 - False discovery rate Specificity True negative rate 1 - False positive rate 1 - Statistical significance level Accuracy

6 Chromatin immunoprecipitation followed by sequencing (ChIP-seq) Sequence and map to reference genome

7 Problem: Given a ChIP-seq experiment for factor X, where does X bind?

8 Problem: Given a ChIP-seq experiment for factor X, where does X bind? Short answer: Stack up the reads in the genome; choose the tall stacks. Issues to consider: Sequencing fragment lengths Sequencing read lengths Mappability GC bias How to pick a threshold and assign statistical confidence?

9 Today s class Accounting for GC and mappability bias in peak calling Convex functions More ChIP-seq peak calling considerations: Handling sequencing fragment sizes Choosing a p-value threshold Control experiments Other functional genomics assays

10 ChIP-seq read counts are biased by GC content and mappability

11 How can we model the background distribution of ChIP-seq reads?

12 The Poisson distribution models sequencing read counts k P (k )=e k Mean: λ Variance: λ

13 The negative binomial distribution models sequencing read counts more flexibly than Poisson P (k r, p) = k + r 1 (1 p) r p k k Mean: Variance: pr 1 p pr (1 p) 2

14 The negative binomial distribution models the mean and variance separately Principle: Make the weakest assumptions you can afford to in your modeling choices

15 MOSAiCs background model Number of counts at 50-bp bin j N j NegBin(a, a/µ j ) µ j =exp( 0 + M M j + GC GC j ) Mappability at bin j GC content at bin j

16 How do we know if the MOSAiCS model is even optimizable?

17 Answer: The negative log likelihood is convex The class of convex functions (defined on the next slide) roughly corresponds to the set of efficiently optimizable functions. When presented with an objective function, the first step is usually to check if it is convex.

18 Convex functions A function f(x) is convex if it satisfies the property: f( x +(1 )y) apple f(x)+(1 )f(y) for all x, y, 0 apple apple 1. Convex functions have no local minima.

19 Concave functions A function f(x) is concave if -f(x) is convex. Convex functions are usually efficiently minimizable. Concave functions are usually efficiently maximizable. (A function can be neither convex nor concave.)

20 Examples on one variable Stanford EE364a

21 Examples on one variable Convex: Non-convex: Duke stat376

22 Examples on vectors x 2 R n a ne function (convex and concave): a T x + b a 2 R n,b2 R Euclidian norm (convex): s X i x 2 i

23 First derivative criterion for convexity Derivative of a function on vectors: f(x) is convex if and only if: Stanford EE364a

24 Second derivative criterion for convexity If f(x) is on one variable, f(x) is convex if and only if d 2 f dx 2 0 Stanford EE364a

25 Second derivative criterion for convexity Derivative of a function on multiple variables: f(x) is convex if and only if: r 2 f(x) = i.e. if r 2 f(x) is positive semi-definite x 1 x v T r 2 f(x)v 0 for v 2 R n Stanford EE364a

26 Examples Convex if P is positive semi-definite. Stanford EE364a

27 A non-convex function f(x 1,x 2 )=x 1 x 2 = x 1 x 2 apple apple x1 x 2 Not positive semi-definite

28 Practical ways to establish convexity of a function Verify the definition. Verify that the second derivative is always positive semidefinite. Show that the function can be obtained from simple convex functions by operations that maintain convexity.

29 Some operations that maintain convexity Stanford EE364a

30 The MOSAiCS objective is concave in β =[ 0 M GC] T F j =[1M j GC j ] T µ(f j, )=exp( T F j )=exp( 0 + M M j + GC GC j ) log P (N ) = log Y j P (N j )= X log P (N j ) j = X j = X j log log Nj + a +1 N j Nj + a +1 N j 1 a Nj a a Y µ(f j, ) µ(f j, ) X a + a log 1 + N j log a µ(f j, ) X N j log µ(f j, ) log(1 1/ exp(x)) X N j log µ(f j, )=N j T F j (a ne in )

31 The MOSAiCS objective is concave in β =[ 0 M GC] T F j =[1M j GC j ] T µ(f j, )=exp( T F j )=exp( 0 + M M j + GC GC j ) log P (N ) = log Y j P (N j )= X log P (N j ) j = X j = X j log log Nj + a +1 N j Nj + a +1 N j 1 a Nj a a Y µ(f j, ) µ(f j, ) X a + a log 1 + N j log a µ(f j, ) X N j log µ(f j, ) log(1 1/ exp(x)) X N j log µ(f j, )=N j T F j (a ne in )

32 The MOSAiCS objective is concave in β =[ 0 M GC] T F j =[1M j GC j ] T µ(f j, )=exp( T F j )=exp( 0 + M M j + GC GC j ) log P (N ) = log Y j P (N j )= X log P (N j ) j = X j = X j log log Nj + a +1 N j Nj + a +1 N j 1 a Nj a a Y µ(f j, ) µ(f j, ) X a + a log 1 + N j log a µ(f j, ) X N j log µ(f j, ) log(1 1/ exp(x)) X N j log µ(f j, )=N j T F j (a ne in )

33 What if your function is not convex? Is there a monotonic transform that makes it convex? Example: Y Next best thing: split the function into convex parts. Example: i x i log Y x i = X log x i i i f(x 1,x 2 )=x 1 x 2 Neither convex nor concave Concave Convex in either x1 or x2 but not both at once. Optimize each in turn. Example: EM.

34 What if your function is not convex? Is there a monotonic transform that makes it convex? Example: Y Next best thing: split the function into convex parts. Example: i x i log Y x i = X log x i i i f(x 1,x 2 )=x 1 x 2 Neither convex nor concave Concave Convex in either x1 or x2 but not both at once. Optimize each in turn. Example: EM.

35 Optimizing convex objectives There are general convex optimization software packages. Even when these are too slow, convex functions usually admit fast optimization specific to your problem. More on convex optimization next class.

36 How well does it work?

37 MOSAiCS enrichment model Non-bound positions: NegBin(a, a/µ j ) Bound positions: NegBin(a, a/µ j )+NegBin(b, c) Bound? Mappability GC content Tag counts

38 MOSAiCS peak calls are more enriched for motifs than competing method

39 MOSAiCS calls many more peaks in low mappability and GC regions than other methods

40 Today s class Accounting for GC and mappability bias in peak calling Convex functions More ChIP-seq peak calling considerations: Handling sequencing fragment sizes Choosing a p-value threshold Control experiments Other functional genomics assays

41 The ChIP-seq protocol enriches for a particular fragment size Chromatin DNA fragments ChIP, sonication size selection sequence fragment length

42 Translate reads to the inferred center of the sequencing fragment correlation between strands strand shift

43 The phantom peak results from mappability islands unmappable mappable ChIP-seq measure of quality: relative strand correlation (RSC) read length

44 Today s class Accounting for GC and mappability bias in peak calling Convex functions More ChIP-seq peak calling considerations: Handling sequencing fragment sizes Control experiments Choosing a p-value threshold Other functional genomics assays

45 ChIP-seq controls Input: IgG: Skip IP Use irrelevant antibody Reasons for controls: - sonocation bias - CNVs - sequence composition bias

46 Today s class Accounting for GC and mappability bias in peak calling Convex functions More ChIP-seq peak calling considerations: Handling sequencing fragment sizes Control experiments Choosing a p-value threshold Other functional genomics assays

47 Problem: Different background models result in wildly different false discovery-rate estimates

$rate (IDR): Expected fraction of peaks that are not reproducible between biological replicates.$

48 Idea: Control reproducibility of peaks between biological replicates Irreproducible discovery rate (IDR): Expected fraction of peaks that are not reproducible between biological replicates.

49 IDR can handle varying quality levels

50 Today s class Accounting for GC and mappability bias in peak calling Convex functions More ChIP-seq peak calling considerations: Handling sequencing fragment sizes Control experiments Choosing a p-value threshold Other functional genomics assays

51 ChIP-exo has better spatial resolution than ChIP-seq

52 DamID measures TF binding through a fusion protein Dam+TF fusion protein Measure methylation at GATCs DamID vs. ChIP-seq: DamID can be easier ChIP requires (specific) antibody DamID requires fusion protein DamID can t query post-transcriptional modification (histone mods) ChIP has better spacial resolution ChIP is limited by cross-linking bias DamID is limited by GATC content and Dam reactivity ChIP has better temporal resolution: Dam acts over ~24 hours

53 DNase-seq and ATAC-seq measure DNA accessibility

54 High-depth DNase-seq (DNase-DGF) measures TF binding

55 Paired-end DNase and ATAC-seq measure nucleosome architecture

56 Administrivia Homework 6 due tomorrow. Homework 7 is up online. Due Friday Next week: Broad (non peak-y ) functional genomics assays; Chromatin architecture. Please write 1-minute responses.

Genome 541 Gene regulation and epigenomics Lecture 2 Transcription factor binding using functional genomics

Genome 541 Gene regulation and epigenomics Lecture 2 Transcription factor binding using functional genomics I believe it is helpful to number your slides for easy reference. It's been a while since I took