Feature Based Gene Summary Extraction with Re-ranking

Size: px

Start display at page:

Download "Feature Based Gene Summary Extraction with Re-ranking"

Geoffrey Davis
5 years ago
Views:

1 Feature Based Gene Summary Extraction with Re-ranking Samir Gupta Computer and Information Sciences University of Delaware Newark, DE USA Abstract Due to the vast availability of bio-medical literature, searching medical databases for information about genes is getting problematic and cumbersome. Searching PubMed with a gene name as query, returns thousands of results, including irrelevant ones. Gene Ontology(GO) and UniProtKB databases provide an indication of relevant terms associated with a gene but is not enough for a quick understanding of the different properties of the gene. Besides these are manually written and curated which is both labor-intensive and time consuming. Automatically generating summaries for gene would help biologists to get an overall picture about the gene quickly. In this paper we adapt generic feature-based extractive summarization techniques and augment it with biomedical domain specific features. We also use the concept of novelty to reduce the redundancy in the extracted summary. Our results show inclusion of domain specific features and redundancy removal improve the content of the summary significantly. 1 Introduction Biomedical databases like PubMed(McEntyre and Lipman, 2001) and BioMed Central 1 are expanding rapidly and contain millions of articles. Due to this vast amount of information, biologists spend a large amount of time searching and reading articles to find relevant information. One such information need which life scientists look for is gene-specific information. A quick overview of the different properties, functions and other aspects of a gene would be very useful. Efforts have 1 been made to construct databases such as Entrez- Gene(Maglott et al., 2005), Gene Ontology 2 and UnitProtKB 3, which provides important information about a gene. But these database are manually created and require curation and regular updates which is labor-intensive. This necessitates the development automatic gene summary extractor. In this paper we describe an approach which expands on the generic features used in summary extraction by including domain specific features. Feature based summary extraction techniques were explored by (Edmundson, 1969; Kupiec et al., 1995) for generic domains. We augment these features with certain domain specific features like presence of gene name and certain biological cue phrases. We also use a variant of Maximal Marginal Relevance(Carbonell and Goldstein, 1998) to reduce the redundancy in the final summary. We use different modules of egift(tudor et al., 2010), a gene information mining tool, to extract the set of abstracts relating to a gene, compute descriptive words, and extract gene name variations. The major contributions of this paper are: Applying the generic features used by Edmundson(1969) to the biomedical domain. Augmenting the generic features with biomedical domain specific features. Using terms provided by Gene Ontology and UnitProtKB medical databases to re-rank sentences based on information novelty. 2 Approach In this section we discuss the details of the the gene summarization system. The input to this system is a gene identifier, same as the one used in the egift system (Tudor et al., 2010). Given a gene

2 identifier, we first extract a set of abstracts from Medline. The retrieval of relevant abstracts for a given gene is done using the egrab (Extractor of Gene-Relevant ABstracts) module of egift. The egrab module considers all gene names, synonyms, and aliases, to query the Medline database and return a set of abstracts for the given gene. Each sentence in the set of abstracts is scored based on a number of features. A subset of these features like term frequency, sentence position, presence of title words and sentence length, are similar to the ones used in (Edmundson, 1969; Kupiec et al., 1995). In addition to these, we use features like the presence of the gene name and certain biological phrases to adapt the generic techniques to the biological domain. After several iterations on some test genes, we manually assigned weights to each of the features and compute a final score for a sentence. The top ranking sentences are then selected to be included in the summary. We have also explored the notion of information novelty to reduce redundancy across the sentences to be selected. This approach is based on Maximal Marginal Relevance(MMR) model used in (Carbonell and Goldstein, 1998), but the difference lies in the computation of novelty and how it is used. Based on the MMR model we re-rank a subset of the sentences returned by the featuredbased system. In the next two subsection we will discuss in details the features and the re-ranking system. 2.1 Computing Sentence Importance The set of abstracts returned by egrab mdoule are preprocessed and segmented into sentences. A set of features are used to score and compute the importance of the sentences. Based on the weighted score, the sentences are ranked and top ranking sentences included in the summary. The first four features are used by generic extractive summarizers. We have added two new features which are more specific to the bio-medical domain. We name the system using the first four features as System-A. System-A will help us understand if and how well generic approaches adapt to particular domain. We hope to see a significant difference and improvement when the last two domain specific features are added(system-b). Sentence Position Feature This features encodes positional information about a sentence in an abstract. Sentence position can be one of the following: title, first, last and middle sentence. As argued in early work of extractive summarization by Edmundson(1969), first and last sentences are typically important than other sentences. Thus higher scores are assigned to first and last sentence positions as opposed to middle or title sentence positions. Title Words Feature This feature assigns a score between 0 and 1 to a sentence based on the presence of title words in the sentence. The title of the abstract are decomposed into words, the words stemmed. These words are regarded as descriptive words and each sentence is scored based on the frequency of occurrence of title words in them. The score is divided by the length of the sentence and then normalized. Sentence Length Feature Kupiec et al.(1995) used sentence length as one of the feature for summarization. In their implementation the feature was true if the sentence length was above a certain threshold, thereby giving less importance to very short sentences. In our system, we have used a low and a high threshold is used to assign low scores to very short or very long sentences. Very long sentences alongwith containing some relevant information contain unnecessary information(noise, we argue should also be given a low score. This helps us to select short sentences in which noise is minimal and thus is more informative to the user. It also helps us in the second phase - the re-ranking step, by allowing more relevant and novel sentences to be selected. Frequency Based Feature This feature is used to assign a score to the sentence between 0 and 1, indicating the presence of descriptive words in the sentence. Most of the early works in the area of summarization used term frequency and its variations to identify the most descriptive words of a document. Term Frequency*Inverse Document Frequency (TF*IDF) has been used in the field of Information Retrieval(Salton and Buckley, 1988; Jones, 1972) as a measure of computing descriptive words in a document. We use egift s(tudor et al., 2010) iterm scores, a variant of TF*IDF weights to extract descriptive words in a set of abstracts relating to a gene. egift automatically computes and associates informative term, iterms with a gene based

3 on frequency information from a set of abstracts returned by egrab module, which is called the About Set for the gene. It assigns scores to unigrams and bigrams, excluding stop-words, as well as a set of bio-medical terms that we extracted from different knowledge bases, including Entrez- Gene, Gene Ontology, NCBI Taxonomy, UMLS, and MeSH that matched in text. The terms are converted to base-form for scoring purposes. Each term is assigned a score depending on its frequency in the About Set, contrasted with it s frequency in Background Set. The background set is the set of all abstracts in the bio-medical database. For each term t, a score s(t) is assigned as follows: s(t) = ( df a(t) N a df b(t) ) ln( N b N b df b (t) ) where df a (t) and df b (t) are the number of abstracts containing term t in the About Set for the gene and the Background Set, respectively, and N a and N b are the total number of abstracts in these two sets. The difference between the normalized document frequencies dfa(t) N a df b(t) N b rewards terms occurring more frequently in the About Set and ln( N b df b (t)) penalizes very frequent terms in all documents. An important thing to note is that egift considers document frequency as opposed to term frequency in a specific document. This is because, iterms are descriptive terms across a set of abstracts and not a single document and thus yields better relevance of term to a gene. Given the score for each term a set of top ranking informative terms or iterms are computed for gene. We score each sentence in the About Set of a gene by considering the occurrences of the iterms and its score. The final score is divided by the number of words in the sentence and normalized. Gene Feature The abstracts returned by the egrab module are related to the gene, whose summary is to be extracted. This feature indicates the presence of the gene name in the sentence. The sentences may or may not contain the gene name, which might be used as an indicator of the sentence s importance. This features assigns a score of 1 to sentences which contains the gene name and 0 otherwise. This boosts the score of sentences containing the gene name in them. A gene in bio-medical literature is referred by several names, abbreviations. For example the SMAD2 has variations such as Smad family member 2, smad-2, madr2, xsmad2 etc. egift provides certain APIs which given a gene identifier returns all the variations of the gene name. It uses official names of genes provided by Entrez Gene(Maglott et al., 2005), synonyms, and word sense disambiguation techniques to return the different variations. Biological Cue Phrase Feature This features assigns a score between 0 and 1 depending upon the presence of certain phrases in the sentence. This approach is based on the fact that certain phrases in a document indicates sentence importance. Authors of technical documents follow certain writing styles, using certain phrases to indicate important relations between different entities in text. These writing styles are domain dependent and require study of the documents to identify them. We argue that phrases are more important than others to indicate a sentence important as they convey very strong relations between the entities in text. EntrezGene(Maglott et al., 2005) contains manually created summaries for some of the genes. We did a preliminary study of the human written summaries from Entrez, in-order to understand, what types of information is typically conveyed in a summary. We identified several aspects which are covered almost in every summary. ATTRIBUTE: The different properties/attributes associated with a gene. FAMILY: Gene family the gene belongs to. FUNCTION: The various biological functions or processes the gene is involved in. DOMAIN: The domains the gene contains. INTERACTION: The interaction of this gene with other gene or proteins. DISEASE: Diseases caused by this gene. These aspects were found to span multiple sentences or different aspects mentioned in a single sentence. For the purposes of this paper we explored the first three aspects. In next paragraph we examine first three aspects in some details and discuss the biological phrases associated with each. ATTRIBUTE: A gene typically has some wellknown properties which need to be captured in a summary. These are typically isa relations between a gene and a noun phrase. For example, sentence fragments like,.. groucho proteins are transcriptional corepressors.. and.. groucho homolog tle-4, a corepressor.. both indicate the gene groucho is a corepressor. Thus for this as-

4 pect we look for phrases like is a, appositives and relative clauses. The pattern should be immediately preceded by the gene in question for this feature to be considered. FAMILY: Almost all gene belongs to a family of genes, which share certain common characteristics. Including the family information, helps biologists to ascertain certain important attributes of the gene. For example, sentence fragments like, The Drosophila Groucho (Gro) protein is the defining member of a family of metazoan corepressors.., Groucho (Gro) is the founding member of a family of transcriptional co-repressor.. indicate that grocho belongs to a family of gene which are corepressors. For this aspects we look for phrases like belongs to and member of. Similar to the above patterns, this pattern should be immediately preceded by the gene in question for this feature to be considered. FUNCTION: Most of the sentences in the human written summaries contain this aspect. These indicate the different biological processes and functions the gene is involved in, required for etc. These are typically mentioned with different aspects, for example typically followed after an IN- TERACTION apsect. Identifying the different functions of a gene is very important and sentences which mention such kind of relations should be included in a summary. From the following sentence fragments we can determine easily that groucho is related to the biological functions such as notch signaling, segmentation and neural development. Examples: Groucho is a transcriptional repressor implicated in notch signaling..,.. Groucho.. involved in neural development and segmentation in drosophila, Groucho is required for Drosophila neurogenesis, segmentation.. and that Gro/TLE proteins play a role in the repression of target genes. We look for the highlighted phrases mentioned in the above sentences when assigning this bio-feature. The gene may not immediately precede the pattern for this aspect, but further the gene from the phrase, the lower the score. Each sentence in the About Set for a gene is searched for the mentioned patterns. The sentence should also contains the gene name. The lexical distance between gene mention and the pattern/phrase is considered while assigning the score for this feature. The distance should be small for FAMILY and ATTRIBUTE aspects, and may be longer for the FUNCTION aspect. The score for each bio-feature in a sentence is added and the scores normalized. 2.2 Re-Ranking based on Novelty Gene summary should contain as much diverse information as possible, thereby reducing the redundancy of information, while maintaining maximal relevance to the gene. As the number of abstracts in the About Set for a gene is very large in number, sentences extracted based only in feature scores may contain high amount of redundant information. Hence the removal of information is necessary, hence redundant sentences should not be selected when producing the final summary. The main intuition behind this method is based on Maximal Marginal Relevance (MMR)(Carbonell and Goldstein, 1998). A sentence which is similar to a sentence already selected should be penalized. A weighted combination of the feature score and novelty score is used to make selected maximally diverse and maximally relevant sentences to a gene. Algorithm 1 provides the pseudo-code for the re-ranking systems. Our re-ranking system takes as input the set of ranked sentences returned the featured based method discussed in section 2.1. For every selected sentence a set of important terms is computed. These include GO terms and UniProtKB keywords. Gene Ontology (GO)project is a major bioinformatics initiative with the aim of standardizing the representation of gene and gene product attributes across species and databases. The project provides a controlled vocabulary of terms for describing gene product characteristics and gene product annotation data from GO Consortium member. The ontology covers three domains: cellular component, molecular function and biological process. The UniProt Knowledgebase (UniProtKB) is the central hub for the collection of functional information on proteins, with accurate, consistent and rich annotation. UniProtKB gene entries are tagged keywords relating to the gene. Instead of considering and minimizing similarity between two sentences as used in MMR, we compute novel score for each sentence. When a sentence is selected, the GO terms and UniProt Keywords are added to the set seletedt erms. The novel score for a sentence is assigned based on the number of new GO terms and UniProt Keywords that is contained in the sentence. The final

5 Input: Set of Ranked Sentences Set D Tuning parameter : λ Output: Set of Re-Ranked Sentences R selectedt erms empty; rerankedsents empty; while D is not empty do foreach sentence s in the set D do fscore s feature score for s; extract GO Terms for s; extract UniProt Keywords for s; add extracted terms to currt erms s ; newt erms s diff(currt erms s, selectedt erms); nscore s novelscore(newt erms s ); score λ f Score+(1 λ) nscore; end determine sent s for which score s is max; delete s from D; add s to R; add newt erms s to selectedt erms; end return R; Algorithm 1: Novelty Based Re-Rank Table 1: Features Based Ranking: Summary Phrases Matches System A System B Improvement SMAD % VPS % BRI % BAG3 0 3 NA% LTBP % KAT2A % score is a weighted depending on a user-tunable parameter λ. The sentence with the highest final score is added to set of re-ranked sentences and deleted from the original ranking. Finally the GO terms and UniProt Keywords are added to the set selectedt erms. A λ value closer to 1 will yield a relevance based ranking while λ value closer to 0 will retrieve a novelty based ranking. When the initial rank set of sentences is empty the algorithm stop and yield a new ranking of sentences. 3 Results In this section we present the results of our evaluation. We used six genes for evaluation purposes. EntrezGene Summary for these genes were used as the gold set. We measured the number of phrase in the extracted sentences which matched with the phrases in the summary. While matching phrases we also considered the relation between the phrase and the gene. A phrase in extracted summary sentence was said to matched if it matched to a phrase in the gold set and had the same relation with gene as in the gold set. For example for gene kat2a a summary sentence is: KAT2A, or GCN5, is a histone acetyltransferase (HAT) that functions primarily as a transcriptional activator.. Re- Ranking system with λ = 0 extracted the following sentence : histone acetyltransferases ( hats ) such as gcn5 play a role in transcriptional activation. The phrase transcriptional activation is marked as matched because its has the same relation with the gene i.e. same function. Figure 1 shows the matching phrases for the gene smaad2 in the summary extracted from the feature based system. System A refers to output generated by using only generic features while System B refers to the output generated by adding the bio-domain specific features. The matched phrases are shown as bold text. Figure 2 shows matching phrases in the summary extracted by the re-ranking system with lambda = 0, 0.3and0.7. A lambda value closer to 0 indicated more importance to information novelty. Table 1 shows the comparison between System A and System B with respect to number of phrase matches each system achieved. The last column indicates the improvement of System B over System A i.e. improvement after adding bio-domain specific features. The results indicate adding domain specific features increase the phrase matches and thus improving the summary content. Table 2 shows the number of matched phrases for the re-ranking system over different values of λ. The first column with λ = 1 is the same as System B in table 1. In the evaluation of the re-ranking system we have the used th set of ranked sentences returned by System B only. The results indicate λ value closer to 0 yields the best results for most of the genes. For example for the gene bri1 the set of summary sentences : BRI1 ligand is brassinolide which binds at the extracellular domain. Binding results in phosphorylation of the kinase domain which activates the BRI1 protein leading to BR responses. is accurately captured by the reranker system (with λ = 0) sentence : brassinosteroids ( brs ) bind to the extracellular domain of the receptor kinase bri1 to activate a signal trans-

6 Table 2: Novelty Based Re-ranking: Summary Phrases Matches λ = 1 λ = 0.9 λ = 0.7 λ = 0.3 λ = 0 Max Improvement over System B SMAD % VPS % BRI % BAG % LTBP % KAT2A % duction cascade that regulates nuclear gene expression and plant development. A similar example occurs for the gene smad2 with extracted sentence: activated tbetari phosphorylates smad2, which then heterodimerizes with smad4, translocates into the nucleus, and subsequently effects gene transcription. which perfectly captures the a set summary sentences(refer fig1). 4 Conclusion We combine generic features for computing sentence with certain bio-medical domain specific features like presence of gene name and biological cue phrases. We also use GO terms and Unit- ProtKeywords as a novelty measure to re-rank sentences and remove information redundancy. Our evaluation suggests that bio-medical features and redundancy removal augmented system extract much more informative summaries. One of the problems of these extractive approaches is the presence of noise in addition to relevant information in the extracted sentences. For example consider a extracted summary sentence for smad2: second, the role of smad 2, an intracellular mediator of activin and tgf-beta, in oocyte maturation was investigated. Only the highlighted fragment is relevant and there is no need to include the entire sentence. In future, we hope that the biological relation patterns discussed in section 2.1 will helps us to determine only the relevant portions of a sentence. These patterns will helps us create an intermediate representation of the set of sentences like smad2 [isa] intracellular mediator OF(activin). Instead of just extracting representative sentences from the About Set, these relations will helps us generate phrases and move toward abstractive summarization. We could combine different relations in a single depending on certain causal links like, INTERACTION aspect followed by FUNCTION aspect. References [Carbonell and Goldstein1998] Jaime Carbonell and Jade Goldstein The use of mmr, diversitybased reranking for reordering documents and producing summaries. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages ACM. [Edmundson1969] H. P. Edmundson New methods in automatic extracting. J. ACM, 16(2): , April. [Jones1972] Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28(1): [Kupiec et al.1995] Julian Kupiec, Jan Pedersen, and Francine Chen A trainable document summarizer. In Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR 95, pages 68 73, New York, NY, USA. ACM. [Maglott et al.2005] Donna Maglott, Jim Ostell, Kim D. Pruitt, and Tatiana Tatusova Entrez gene: gene-centered information at ncbi. Nucleic Acids Research, 33(suppl 1):D54 D58. [McEntyre and Lipman2001] Johanna McEntyre and David Lipman Pubmed: bridging the information gap. Canadian Medical Association Journal, 164(9): [Radev et al.2002] Dragomir R. Radev, Eduard Hovy, and Kathleen McKeown Introduction to the special issue on summarization. Comput. Linguist., 28(4): , December. [Salton and Buckley1988] Gerard Salton and Christopher Buckley Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5): [Tudor et al.2010] Catalina O Tudor, Carl J Schmidt, and K Vijay-Shanker egift: Mining gene information from the literature. BMC bioinformatics, 11(1):418.

7 Gene : SMAD2 Entrez Summary: The protein encoded by this gene belongs to the SMAD, a family of proteins similar to the gene products of the Drosophila gene 'mothers against decapentaplegic' (Mad) and the C. elegans gene Sma. SMAD proteins are signal transducers and transcriptional modulators that mediate multiple signaling pathways. This protein mediates the signal of the transforming growth factor (TGF)-beta, and thus regulates multiple cellular processes, such as cell proliferation, apoptosis, and differentiation. This protein is recruited to the TGF-beta receptors through its interaction with the SMAD anchor for receptor activation (SARA) protein. In response to TGF-beta signal, this protein is phosphorylated by the TGF-beta receptors. The phosphorylation induces the dissociation of this protein with SARA and the association with the family member SMAD4. The association with SMAD4 is important for the translocation of this protein into the nucleus, where it binds to target promoters and forms a transcription repressor complex with other cofactors. This protein can also be phosphorylated by activin type 1 receptor kinase, and mediates the signal from the activin. System A (without Bio-Features) smad2 overexpression suppressed osteocalcin mrna expression in ros17/2.8 cells. tgfbeta signaling is initiated when the type i receptor phosphorylates the mad-related protein, smad2, on c-terminal serine residues. mad-related genes on chromosome 18q21.1 are altered infrequently in escc. activation of transforming growth factor-beta ( tgf-beta ) receptors triggers phosphorylation of smad2 and smad3. cells that lack smad2 may escape from tgf-beta-mediated growth inhibition and promote cancer progression. phosphorylation-dependent activation of the transcription factors smad2 and smad3 plays an important role in tgfbeta-dependent furthermore, we observed a strong correlation between sustained smad2 phosphorylation and resistance to tgf-beta1-mediated growth inhibition. System B(With Bio-Features) phosphorylation-dependent activation of the transcription factors smad2 and smad3 plays an important role in tgfbeta-dependent we report that smad2, a transcription factor activated by tgfbeta, mediates tgf-beta induction of enos in endothelial cells. identification of smad2, a human mad-related protein in the transforming growth factor beta signaling pathway. conclusions : the results suggest that mutation of smad2 does not play a key role in human stomach carcinogenesis. second, the role of smad 2, an intracellular mediator of activin and tgf-beta, in oocyte maturation was investigated. thus, heteromeric complex formation of smad2 with smad4 is required for nuclear translocation of smad4. evidence that smad2 is a tumor suppressor implicated in the control of cellular invasion. Figure 1: Feature-Based Ranked Summaries for SMAD2 for System A and B

8 λ = phosphorylation-dependent activation of the transcription factors smad2 and smad3 plays an important role in tgfbeta-dependent 2. second, the role of smad 2, an intracellular mediator of activin and tgf-beta, in oocyte maturation was investigated. 3. smad2 and smad3 are signalling proteins that are involved in mediating the transcriptional regulation of target genes downstream of transforming growth factor-beta and activin receptors. 4. activated tbetari phosphorylates smad2, which then heterodimerizes with smad4, translocates into the nucleus, and subsequently effects gene transcription. 5. identification of smad2, a human mad-related protein in the transforming growth factor beta signaling pathway. 6. xmad2, a recently identified tgf-beta signal transducer, forms a complex with the transcription factor in an activin-dependent fashion to generate an activated are-binding complex. 7. ligation of the t cell receptor complex results in phosphorylation of smad2 in t lymphocytes. λ = phosphorylation-dependent activation of the transcription factors smad2 and smad3 plays an important role in tgfbeta-dependent 2. second, the role of smad 2, an intracellular mediator of activin and tgf-beta, in oocyte maturation was investigated. 3. smad2 and smad3 are signalling proteins that are involved in mediating the transcriptional regulation of target genes downstream of transforming growth factor-beta and activin receptors. 4. identification of smad2, a human mad-related protein in the transforming growth factor beta signaling pathway. 5. thus, heteromeric complex formation of smad2 with smad4 is required for nuclear translocation of smad4. 6. ubiquitination of smad2 is a consequence of its accumulation in the nucleus. 7. xmad2, a recently identified tgf-beta signal transducer, forms a complex with the transcription factor in an activin-dependent fashion to generate an activated are-binding complex. λ = phosphorylation-dependent activation of the transcription factors smad2 and smad3 plays an important role in tgfbeta-dependent 2. second, the role of smad 2, an intracellular mediator of activin and tgf-beta, in oocyte maturation was investigated. 3. identification of smad2, a human mad-related protein in the transforming growth factor beta signaling pathway. 4. thus, heteromeric complex formation of smad2 with smad4 is required for nuclear translocation of smad4. 5. we report that smad2, a transcription factor activated by tgf-beta, mediates tgf-beta induction of enos in endothelial cells. 6. conclusions : the results suggest that mutation of smad2 does not play a key role in human stomach carcinogenesis. 7. evidence that smad2 is a tumor suppressor implicated in the control of cellular invasion. Figure 2: Re-Ranked Summaries for SMAD2 with λ = 0, 0.3, 0.7

Biol403 - Receptor Serine/Threonine Kinases

Biol403 - Receptor Serine/Threonine Kinases The TGFβ (transforming growth factorβ) family of growth factors TGFβ1 was first identified as a transforming factor; however, it is a member of a family of structurally