Large-scale gene family analysis of 76 Arthropods i5k webinar / September 5, 28 Gregg Thomas @greggwcthomas Indiana University
The genomic basis of Arthropod diversity https://www.biorxiv.org/content/early/28/8/4/382945 Why and how did we do it? 2
The genomic basis of Arthropod diversity https://www.biorxiv.org/content/early/28/8/4/382945 Why and how did we do it? 3
The genomic basis of Arthropod diversity https://www.biorxiv.org/content/early/28/8/4/382945 Why and how did we do it? https://fivethirtyeight.com/features/the-bugs-of-the-world-could-squish-us-all/ 4
Arthropods exhibit vast phenotypic diversity Aedes aegypti (Diptera) Bombus terrestris (Hymenoptera) Latrodectus hesperus (Araneae) 5
Whole-genome sequencing reveals vast molecular differences AAGGCCA AAGTCCA AAGTCCA Nucleotide substitutions Aedes aegypti (Diptera) Bombus terrestris (Hymenoptera) Latrodectus hesperus (Araneae) 6
Whole-genome sequencing reveals vast molecular differences 4 4 2 Gene copy number variation AAGGCCA AAGTCCA AAGTCCA Nucleotide substitutions Aedes aegypti (Diptera) Bombus terrestris (Hymenoptera) Latrodectus hesperus (Araneae) 7
Whole-genome sequencing reveals vast molecular differences Protein domain arrangements 4 4 2 Gene copy number variation AAGGCCA AAGTCCA AAGTCCA Nucleotide substitutions Aedes aegypti (Diptera) Bombus terrestris (Hymenoptera) Latrodectus hesperus (Araneae) 8
Whole-genome sequencing reveals vast molecular differences Protein domain arrangements 4 4 2 Gene copy number variation AAGGCCA AAGTCCA AAGTCCA Nucleotide substitutions Aedes aegypti (Diptera) Bombus terrestris (Hymenoptera) Latrodectus hesperus (Araneae) 9
Whole-genome sequencing reveals vast molecular differences Protein domain arrangements 4 4 2 Gene copy number variation AAGGCCA AAGTCCA AAGTCCA Nucleotide substitutions Aedes aegypti (Diptera) Bombus terrestris (Hymenoptera) Latrodectus hesperus (Araneae) How can we begin to understand what changes are interesting or important?
Phylogenies act as a framework for asking these types of questions How can we begin to understand what changes are interesting or important?
Phylogenies act as a framework for asking these types of questions How can we begin to understand what changes are interesting or important? 2 4 4 2 Gene copy number variation
Phylogenies act as a framework for asking these types of questions 4 4-2 How can we begin to understand what changes are interesting or important? 3 4 4 2 Gene copy number variation
Which molecular changes lead to interesting phenotypic differences?. Sequence and annotate many genomes 2. Determine orthology of sequences 3. Determine phylogeny of species Stephen Richards Monica Poelchau 4. Map orthology onto phylogeny to reconstruct the evolutionary history of a locus 4
Which molecular changes lead to interesting phenotypic differences?. Sequence and annotate many genomes 2. Determine orthology of sequences 3. Determine phylogeny of species Rob Waterhouse Evgeny Zdobnov Panagiotis Ioannidis Stephen Richards Monica Poelchau 4. Map orthology onto phylogeny to reconstruct the evolutionary history of a locus 5
Which molecular changes lead to interesting phenotypic differences?. Sequence and annotate many genomes 2. Determine orthology of sequences 3. Determine phylogeny of species Rob Waterhouse Evgeny Zdobnov Panagiotis Ioannidis Stephen Richards Monica Poelchau 4. Map orthology onto phylogeny to reconstruct the evolutionary history of a locus 6
Which molecular changes lead to interesting phenotypic differences?. Sequence and annotate many genomes 2. Determine orthology of sequences 3. Determine phylogeny of species Rob Waterhouse Evgeny Zdobnov Panagiotis Ioannidis Stephen Richards Monica Poelchau 4. Map orthology onto phylogeny to reconstruct the evolutionary history of all loci 7 Elias Dohmen Karl Glastad Yiyuan Li
Which molecular changes lead to interesting phenotypic differences?. Sequence and annotate many genomes 2. Determine orthology of sequences 3. Determine phylogeny of species 4. Map orthology onto phylogeny to reconstruct the evolutionary history of all loci 8
Today's topics. Determining the Arthropod phylogeny 2. Reconstructing ancestral gene counts 3. Using the i5k gene family web site 9
Today's topics. Determining the Arthropod phylogeny 2. Reconstructing ancestral gene counts 3. Using the i5k gene family web site 2
2
22
23
24
25
26
27
What are the details of these 6 steps in the context of the i5k project? 28
29
) Predict orthogroups OrthoDB Rob Waterhouse Evgeny Zdobnov Panagiotis Ioannidis 3 38,95 ortho-groups across 76 arthropod species (See i5k webinar from Feb., 27: http://i5k.github.io/webinar)
3
2) Select single-copy orthogroups How many single-copy orthologs in our 38,95 groups? 32
2) Select single-copy orthogroups 33 How many single-copy orthologs in our 38,95 groups? family single copy in all but one species (2 copies in Plutella xylostella)
EOG8DFS3J Single-copy in all but one species (2 copies in Plutella xylostella) 34
EOG8DFS3J Single-copy in all but one species (2 copies in Plutella xylostella) Problem: Hemiptera not monophyletic Problem: Lepidoptera and Trichoptera nested within Diptera 35
EOG8DFS3J Single-copy in all but one species (2 copies in Plutella xylostella) How can we turn our species rich data into sequence rich data? Problem: Hemiptera not monophyletic Problem: Lepidoptera and Trichoptera nested within Diptera 36
Construct a backbone tree among orders rather than species 37
Construct a backbone tree among orders rather than species "Single-copy" orthologs are now those that:. Are single-copy in ALL the orders represented by a single species. 2. Have at least ONE species that is single-copy in each of the 6 multi-species orders. 38
Construct a backbone tree among orders rather than species "Single-copy" orthologs are now those that:. Are single-copy in ALL the orders represented by a single species. 2. Have at least ONE species that is single-copy in each of the 6 multi-species orders. 39
Construct a backbone tree among orders rather than species "Single-copy" orthologs are now those that:. Are single-copy in ALL the orders represented by a single species. 2. Have at least ONE species that is single-copy in each of the 6 multi-species orders. 4
Construct a backbone tree among orders rather than species "Single-copy" orthologs are now those that:. Are single-copy in ALL the orders represented by a single species. 2. Have at least ONE species that is single-copy in each of the 6 multi-species orders. 4
Construct a backbone tree among orders rather than species "Single-copy" orthologs are now those that:. Are single-copy in ALL the orders represented by a single species. 2. Have at least ONE species that is single-copy in each of the 6 multi-species orders. 42
43 Construct a backbone tree among orders rather than species Phylum # Orders # single-copy orthologs Arthropoda 2 5 Then use single-copy orthologs from the 6 multi-species orders Order # Species # single-copy orthologs Araneae 4 627 Hemiptera 7 253 Hymenoptera 24 22 Coleoptera 6 388 Lepidoptera 5 366 Diptera 4 324
Construct a backbone tree among orders rather than species Phylum # Orders # single-copy orthologs Arthropoda 2 5 Then use single-copy orthologs from the 6 multi-species orders to construct order-level trees Order # Species # single-copy orthologs Araneae 4 627 Hemiptera 7 253 Hymenoptera 24 22 Coleoptera 6 388 Lepidoptera 5 366 44 Diptera 4 324
45
3) Align each group Phylum # Orders # single-copy orthologs Arthropoda 2 5 Order # Species # single-copy orthologs Araneae 4 627 Hemiptera 7 253 Hymenoptera 24 22 Coleoptera 6 388 Lepidoptera 5 366 Diptera 4 324 Two alignment programs:. MUSCLE 2. PASTA 46
47
4) Infer gene trees Phylum # Orders # single-copy orthologs Arthropoda 2 5 Order # Species # single-copy orthologs Araneae 4 627 Hemiptera 7 253 Hymenoptera 24 22 Coleoptera 6 388 Lepidoptera 5 366 Diptera 4 324 RAxML: with PROTGAMMAJTTF amino acid substitution model 48
4) Infer gene trees Phylum # Orders # single-copy orthologs Arthropoda 2 5 Order # Species # single-copy orthologs Araneae 4 627 Hemiptera 7 253 Hymenoptera 24 22 Coleoptera 6 388 Lepidoptera 5 366 Diptera 4 324 RAxML: with PROTGAMMAJTTF amino acid substitution model Topologies largely insensitive to substitution model 49
5
5) Infer species tree Phylum # Orders # single-copy orthologs Arthropoda 2 5 Three species tree methods:. Average consensus 2. Concatenation 3. ASTRAL 5
The backbone phylogeny
The backbone phylogeny Monophyletic Crustacea??
5) Infer species tree Order # Species # single-copy orthologs Araneae 4 627 Hemiptera 7 253 Hymenoptera 24 22 Coleoptera 6 388 Lepidoptera 5 366 Diptera 4 324 Three species tree methods:. Average consensus 2. Concatenation 3. ASTRAL 54
Araneae: 627 genes The Arthropod phylogeny Hemiptera: 253 genes Hymenoptera: 22 genes Coleoptera: 388 genes Lepidoptera: 366 genes Diptera: 324 genes 55
Araneae: 627 genes Hemiptera: 253 genes The Arthropod phylogeny All methods agree Hymenoptera: 22 genes Coleoptera: 388 genes Lepidoptera: 366 genes Diptera: 324 genes 56
Araneae: 627 genes Hemiptera: 253 genes The Arthropod phylogeny Disagreement between methods Hymenoptera: 22 genes Coleoptera: 388 genes Lepidoptera: 366 genes Diptera: 324 genes 57
58
6) Scale branch lengths with fossil calibrations Crown group Node Min time Max time Euarthropoda 75 54 636. Arachnida 74 432.6 636. Parasitiformes 72 98.7 54 Mandibulata 67 54 636. Multicrustacea 64 487 636. Pterygota 62 322.83 52 Paleoptera 39.9 52 Neoptera 6 39.9 4 Blattodea 2 3.3 4 Eumetabola 6 39.9 4 Condylognatha 58 36.9 4 Hemiptera 57 36.9 4 Holometabola 5 33.7 4 Hymenoptera 25 226.4 4 Aparaglossata 5 33.7 4 Coleoptera 3 28.5 4 Mecopterida 49 27.8 4 Amphiesmenoptera 35 95.3 4 Lepidoptera 34 29.4 4 Diptera 48 24.5 4 Use r8s to smooth the tree: Penalized likelihood method to correlate rates of evolution among branches 59 Order Node Min time Max time Hymenoptera HY25 89.9 93.9 Hymenoptera HY3 23 28.4
Arthropod Time Tree LICA 35 mya Holometabola 3 mya 6
Arthropod Time Tree The branches of the ML tree can be scaled by time to infer substitution rates Rates are mostly consistent across arthropods Burst of rapid evolution in ancestral holometabolites 6
Arthropod Time Tree The branches of the ML tree can be scaled by time to infer substitution rates Rates are mostly consistent across arthropods Burst of rapid evolution in ancestral holometabolites 62
Arthropod Time Tree The branches of the ML tree can be scaled by time to infer substitution rates Rates are mostly consistent across arthropods 63
Today's topics. Determining the Arthropod phylogeny 2. Reconstructing ancestral gene counts 3. Using the i5k gene family web site 64
65
66
67
68
69
7
Ancestral gene counts inferred with:. Maximum likelihood (CAFE) for the 6 multi-species orders 2. Parsimony (Dupliphy) for all nodes 7
Ancestral gene counts: Example 3 Tips: observed variables x i : hidden variables 72
Ancestral gene counts: Example 3 Tips: observed variables x i : hidden variables Our goal is to infer the states of the internal nodes of the tree 73
Ancestral gene counts: Example +2-3 Tips: observed variables x i : hidden variables Our goal is to infer the states of the internal nodes of the tree Then we can count changes along each lineage 74
With ancestral gene counts we can:. Infer rates of gene gain/loss 2. Count gene gains and losses and check for rapid changes on every lineage 3. Estimate gene counts in extinct ancestors 75
With ancestral gene counts we can:. Infer rates of gene gain/loss 2. Count gene gains and losses and check for rapid changes on every lineage 3. Estimate gene counts in extinct ancestors 76
Arthropod Time Tree LICA 35 mya Holometabola 3 mya 77
LICA 35 mya Araneae:.7 Hemiptera:. Rates of gene gain/loss between orders are largely consistent Hymenoptera:.9 Holometabola 3 mya Coleoptera:. Lepidoptera:.4 Diptera:. 78
Arthropod Time Tree The branches of the ML tree can be scaled by time to infer lineage specific gain/loss rates Rates are mostly consistent across arthropods Burst of rapid evolution in ancestral holometabolites 79
Arthropod Time Tree The branches of the ML tree can be scaled by time to infer lineage specific gain/loss rates Rates are mostly consistent across arthropods Burst Accelerated of rapid evolution rates in in ancestral leafcutter holometabolites ants 8
Arthropod Time Tree The branches of the ML tree can be scaled by time to infer lineage specific gain/loss rates Accelerated rates in leafcutter ants 8
Elias Dohmen 82
83
Gene gain and loss rates are correlated with protein domain rearrangements 84
Neither are correlated with substitution rate 85
With ancestral gene counts we can:. Infer rates of gene gain/loss 2. Count gene gains and losses and check for rapid changes on every lineage 3. Estimate gene counts in extinct ancestors 86 86
What specific gene family changes are interesting or important? 4 4-2 87 4 4 2 Gene copy number variation 87
Common gene family changes among eusocial insects Termites Bees & ants 4 enriched functional terms in BOTH groups Olfactory reception and odorant binding 88
Common gene family changes among eusocial insects Termites Bees & ants 4 enriched functional terms in BOTH groups Olfactory reception and odorant binding 89
Largest number of gene family changes German cockroach Most protein domain rearrangements Most RAPID gene family changes Major expansion of chemosensory genes 9
Largest number of gene family changes German cockroach Most RAPID gene family changes Major expansion of chemosensory genes Most protein domain rearrangements 9
Spider silk and venom gene families Araneae rapidly expanding gene families within Araneae related for silk or venom High rate of protein domain emergences, including some related to venom 92
Spider silk and venom gene families Araneae rapidly expanding gene families within Araneae related for silk or venom High rate of protein domain emergences, including some related to venom Jessica Garb 93
With ancestral gene counts we can:. Infer rates of gene gain/loss 2. Count gene gains and losses and check for rapid changes on every lineage 3. Estimate gene counts in extinct ancestors 94 94
What did the ancestral insect (LICA) look like? 95
x 9 x 8 x 7 x 6 2 2 2 2 How can we infer characteristics of the genome of LICA? x 5 96 x 3 x2 x 4 x x 9 LICA x x 4 x 3 x 8 x6 x 5 x 7 x x2 4 2 2 2 2 5 2 3
x 9 x 8 x 7 x 6 2 2 2 2 How can we infer characteristics of the genome of LICA? x 5 97 x 3 x2 x 4 x x 9 LICA x x 4 x 3 x 8 x6 x 5 x 7 x x2 4 2 2 2 2 5 2 3 How many genes were present in the LICA genome?
Ancestral genome sizes are underestimated due to extinctions 98
Ancestral genome sizes are underestimated due to extinctions LICA 99 Estimated: 9,6 genes
Ancestral genome sizes are underestimated due to extinctions LICA LICA Estimated: 9,6 genes Corrected: 4,965 genes
How can we infer characteristics of the genome of LICA? 2 + 2 2 2 2 2 5 2 3 Which families were born during the transition to insects?
How can we infer characteristics of the genome of LICA? 2 + 2 2 2 2 2 2 5 2 3 Which families were born during the transition to insects? 47 emergent insect families
Emergent insect families correspond to insect lifestyle adaptations Changes in exoskeleton development 7 chitin and cuticle production families 3
Emergent insect families correspond to insect lifestyle adaptations Changes in exoskeleton development 7 chitin and cuticle production families Ability to sense in a terrestrial environment visual learning and behavior family 2 odorant binding families 5 families involved in neural activity 4
Emergent insect families correspond to insect lifestyle adaptations Changes in exoskeleton development 7 chitin and cuticle production families Ability to sense in a terrestrial environment visual learning and behavior family 2 odorant binding families 5 families involved in neural activity Unique development larval behavior family 4 imaginal disk development families 5
Emergent insect families correspond to insect lifestyle adaptations Changes in exoskeleton development 7 chitin and cuticle production families Ability to sense in a terrestrial environment visual learning and behavior family 2 odorant binding families 5 families involved in neural activity Unique development larval behavior family 4 imaginal disk development families Flight 3 wing morphogenesis families 6
How can we infer characteristics of the genome of LICA? 2 + 2 2 7 2 2 2 5 2 3 Which families were born during the transition to insects?
How can we infer characteristics of the genome of LICA? 8 + Which families were born during the transition to insects? transition to Holometabola? Only emergent Holometabola gene families
How can we infer characteristics of the genome of LICA? 9 + Which families were born during the transition to insects? transition to Holometabola? transition to Lepidoptera?,38 emergent Lepidopteran gene families
Lepidoptera has the most emergent gene families
Today's topics. Determining the Arthropod phylogeny 2. Reconstructing ancestral gene counts 3. Using the i5k gene family web site
All data has been made available in our online tool https://i5k.gitlab.io/arthrofam/
Working search functions temporarily available at: https://cgi.soic.indiana.edu/~grthomas/i5k-web/main.html
4 Demo
Acknowledgements Matthew Hahn Stephen Richards Rob Waterhouse Jessica Garb Elias Dohmen Ariel Chipman The i5k community The Hahn lab + Clara Boothby i5k website: http://i5k.github.io/ Gene family website: https://i5k.gitlab.io/arthrofam/