Discovering MultipleLevels of Regulatory Networks IAS EXTENDED WORKSHOP ON GENOMES, CELLS, AND MATHEMATICS Hong Kong, July 25, 2018 Gary D. Stormo Department of Genetics
Outline of the talk 1. Transcriptional Regulation Specificity of Transcription Factors (TFs) Models and methods of determination Relationship to epigenetics Uses: network modeling and interpreting genomic variants Cooperativity between TFs Measurements, Effects on Specificity Making cis-regulatory elements/modules Post-Transcriptional Regulation Opportunities for regulation Mechanisms Feedback loops
Structure of EGR1-DNA Complex Specificity logo of EGR1-DNA interaction
Representing TF Specificity with a Position Weight Matrix (PWM) Model (aka: Weight Matrix, PSSM) A: -8 10-1 2 1-8 C: -10-9 -3-2 -1-12 G: -7-9 -1-1 -4-9 T: 10-6 9 0-1 11
PWM Model Score = -24.a C T A T A A t g A: -8 10-1 2 1-8 C: -10-9 -3-2 -1-12 G: -7-9 -1-1 -4-9 T: 10-6 9 0-1 11
PWM Model Score = 43.a c T A T A A T g t A: -8 10-1 2 1-8 C: -10-9 -3-2 -1-12 G: -7-9 -1-1 -4-9 T: 10-6 9 0-1 11
A: -8 10-1 2 1-8 C: -10-9 -3-2 -1-12 G: -7-9 -1-1 -4-9 T: 10-6 9 0-1 11 Score S i W = W S i PWM is a linear model: S i encodes the sequence (which base occurs at each position) W weights those encoded features to provide the score Easy to add more features if they are necessary
A: -8 10-1 2 1-8 C: -10-9 -3-2 -1-12 G: -7-9 -1-1 -4-9 T: 10-6 9 0-1 11 Score S i W = W S i PWM is a linear model: S i encodes the sequence (which base occurs at each position) W weights those features to provide the score Easy to add more features if they are necessary George Box: All models are wrong. Some models are useful.
Parameter estimation Various methods for determining parameters: Discriminant learning Probabilistic modeling (i.e. log-odds) Basis of most motif discovery algorithms Regression on quantitative data Binding energy models Stormo (2013) Quantitative Biology 1:115-130
Probabilistic modeling based on known sites N(b,i) PFM (PPM, PWM) PWM (PSSM) F(b,i) W(b,i) = log[f(b,i)/p(b)] I(i) = F(b,i)W(b,i) Motif discovery by Finding sites with max I
Classic Logo (from Tom Schneider): Height of column at each position is Information Content Each base in proportion to its frequency
Binding probabilities depend on the protein concentration Positions are normalized independently, leading to apparent non-independence and mis-0rdering of probabilities Biophysical (energy) models are preferred
Measuring Specificity ( vs Affinity)
Modeling Specificity from high-throughput methods Specificity Modeling Stormo and Zhao, Nature Reviews Genetics, 2010
Diverse sets: >100 TFs ~20 TFs ~240 TFs Weirauch et al >1000 TFs
Uses Expectation Maximization (EM) to simultaneously infer the binding site on each sequence and the parameters of the model (PWM) Out performs all other algorithms on in vitro data, Comparable on in vivo data (ChIP-seq)
HT-SELEX (SELEX-Seq) P(S i b) P(S i ) 1 1+e E i μ Compared to reference sequence with E = 0 P S i b P S i = P S ref b P S ref 1+e μ 1+e E i μ
Spec-seq (specificity by sequencing) P(S i b) P(S i u) = eμ E i Compared to reference sequence with E = 0 P S ref b P S ref P S i b = e E i ln P S i P S ref b P S ref P S i b P S i = E i
Spec-seq: Specificity by sequencing P + S i P S i K A (S i ) = [P S i] P [S i ] K A S 1 : K A S 2 : : K A S n = P S 1 S 1 : P S 2 S 2 : : P S n S n Zuo and Stormo, Genetics, 2014
Can easily measure effects of methylation: M = mc; W = mc on opposite strand Zuo et al, Sciences Advances, 2017
How well do binding models predict regulatory sites (and networks) in cells? In bacteria, pretty well In eukaryotes, quite poorly What information is missing? Only a small fraction of the genome is accessible available for interactions with TFs With that added information, prediction is much improved DNA methylation can inhibit or promote binding not sequence alone but epigenetic marks Cooperativity between TFs can be very important Can also lead to latent specificity
Dnase Hypersensitivity + catalog of PWMs can give rise to GRN NEPH, ET AL 2012 Cell.
Weinhold et al, Nat Gen 2014
Measuring Cooperativity Previous method Determine cooperativity factors of Sox-Oct binding using fluorescently labeled DNA targets. Oct 4 with 11 different Sox TFs ~10 different sequences, each in a separate gel lane
Coop-seq for combinatorial binding can get all of the important parameters in one experiment, including cooperativity Calculating cooperativity S i i = K = K S i = K S i K S i K S i K S i K S i i = 1 : no cooperativity. i 1 : anti-cooperativity. (if i = 0, two proteins are mutually exclusive.) Stormo, i 1 : binding of the second protein is facilitated by binding of the first. Zuo, Chang, Briefings in Functional Genomics, 2015
Quantitative profiling of selective Sox/POU pairing on hundreds of sequences in parallel by Coop-seq (NAR, 2017) Chang et al, collaboration with Ralf Jauch lab
Protein pairs tested in China Sox2 with POU proteins Oct4 Brn4 Oct6 Brn2 Oct4 with Sox proteins Sox2 Sox17 Sox17EK Sox5 Sox15 Sox18
-Omega Energy Sox Family-Oct4 cooperativity energy 8 6 4 2 0-1 0 1 2 3 4 5 6-2 -4-6 Spacer Sox5-Oct4 Sox15-Oct4 Sox17-Oct4 Sox17EK-Oct4 Sox18-Oct4 Sox2-Oct4
Hu et al. J Mol Biol. 2017 Sites that bind Pax6 and Sox2 cooperatively prefer a non-consensus sequence for the Pax6 site Corresponds to sites observed in vivo with ChIP-seq Regulates neuronal developmental genes.
DNA-dependent formation of transcription factor pairs alters their binding specificity. Jolma, et al, Nature, 2015 Surveyed ~9400 pairs of TFs Found 315 (~3%) with significant co-motifs Many have alterations to consensus sites for TFs
Summary of Part 1 Good methods/models exist for TF specificity and cooperativity, including epigenetic effects Combined with additional data, e.g. accessibility, chromatin architecture, expression changes, can lead to network inference and causal modeling Open problems More data needed (comprehensive list of motifs; which TFs are cooperative; effects of epigenetics) How are epigenetic marks established? Causal, cooperative, feedback? Small differences in energy can lead to larger than expected differences in regulation. Kinetic proofreading? Other mechanisms?
Overall Process of Gene Expression DNA transcription RNA translation Protein Regulation can happen at each step Gene Regulatory Networks often only refers to transcriptional regulation, misses much (>50%?) of the total network
Result: on/off (up/down) Overall Process of Gene Expression DNA transcription RNA translation Protein Pre-transcriptional regulation: Chromatin modifications Epigenetics Nuclear architecture Transcriptional initiation regulation: TF binding: activators/repressors Cis-regulatory modules Looping, insulators Signal transduction TF modifications
Overall Process of Gene Expression DNA transcription RNA translation Protein Post-transcriptional regulation: Modulation of translation (on/off; up/down) splicing/alternative splicing Modulation of termination and stability localization RNA editing Frame-shifting
Mechanisms of mrna regulation: RNA-RNA interactions (mirnas, lncrnas, ) Protein-RNA interactions (many examples such as splicing factors, translational inhibitors, editing enzymes) RNA self-regulation (eg. riboswitches) Often involve mrna secondary structures this makes motif discovery much harder
A compendium of RNA-binding motifs for decoding gene regulation Debashish et al. Nature 499,172 177(11 July 2013) Here we report a systematic analysis of the RNA motifs recognized by RNA-binding proteins, encompassing 205 distinct genes from 24 diverse eukaryotes
Prokaryotic riboswitch motifs Widely exist in prokaryotic genomes. Unique RNA structures. http://www.yale.edu/breaker Their structures are conserve across species.
Prokaryotic mrna regulatory motifs Example of Riboswitch regulation mechanism: Case 1: Metabolite is limited. Case 2: Metabolite is abundant. 1 UUUUU AUG 2 3 4 ORF Transcription is completed. Attenuato r UUUUU AUG Transcription is terminated. ORF http://www.yale.edu/breake
Examples of Post-transcriptional autoregulation Ribosomal proteins Primary target: rrna; secondary: own mrna trna synthetases Primary target: trna; secondary: own mrna Translation initiation and release factors Some splicing factors In polycistronic mrnas, translational coupling
Russell Betney, Eric de Silva, Jawahar Krishnan, et al. RNA 2010 16: 655-663
Gene X Activity high low high low Expression
Summary of post-transcriptional regulation part Many opportunities (steps) for regulation Structural motifs are harder to find than sequence motifs Phylogenetic conservation can be a key to finding structural motifs Autoregulation can provide efficient control of expression levels Autoregulation is easy to evolve; mrnas evolves to bind to protein to be inactivated when sufficient protein exists
Strategy to identify autoregulated genes Use collection of yeast strains with GFP fused to protein GFP fusion protein GFP Genes Native expression of the protein Make inducible plasmids with creless versions of candidate genes P3 m-cherry CRE-less
Upon induction, mcherry version is made. Without autoregulation both proteins accumulate in cells GFP fusion protein GFP Genes P3 m-cherry m-cherry fusion protein CRE-less With autoregulation mcherry protein turns off expression of GFP protein GFP Genes P3 m-cherry m-cherry fusion protein CRE-less
Positive control, known to be Autoregulated Red channel PDC1: Pyruvate Decarboxylase Isozyme Green channel
Negative RPS28A Red channel Green channel
New positive Red channel RPL1b Green channel
Acknowledgements Stormo Lab Kenny Chang Manishi Pandey Shuxiang Ruan Zheng Zuo David Granas Basab Roy Jonathan Cher Collaborators Josh Swamidass (BME Wash U) Ralf Jauch (Guangzhou Institutes; now at Univ of Hong Kong) Funding: NIH