1. Challenges
The development of a fast, dynamic and efficient technique to identify the nearest neighbor/topically/contextually similar (i.e., similar to the given text or text collection) DBpedia extended abstracts and deploying them for calculation of the semantic relatedness is the major research challenge. The major points are:
- The presence of a large number of DBpedia Extended Abstracts (with continuously evolving nature) creates the requirements of a fast and dynamic method.
- Extracting nearest neighbor / contextually similar DBpedia abstracts for the given text/text-collection by using cosine similarity (or other similarity measures) makes the task a very time consuming task.
- Indexing dynamically extracted documents for the calculation of corpus based semantics is again a very time consuming task (as for each text/text-collection we may have different set of documents).
2. Some Justifications
The following contains some justifications behind (1) the use of DBpedia extended abstracts, (2) use of nearest neighbor/topically/contextually similar (i.e., similar to the given text or text collection) DBpedia extended abstracts and (3) the use of Normalized Pointwise Mutual Information Score to calculate the corpus based semantics.
- Use of DBpedia extended corpus: Actually, DBpedia extended abstracts are the extract summaries of the corresponding Wikipedia articles, thus, it contains only topically related information (i.e., topically related to the corresponding Wikipedia title) and hence, relatively free from noisy entries.
- Why using nearest neighbor/topically/contextually similar (i.e., similar to the given text or text collection) DBpedia extended abstracts?: Actually, the use of nearest neighbor/topically/contextually similar, DBpedia extended abstracts not only improve the speed of the calculation (i.e., here we are using a fraction of DBpedia extended abstracts, instead of using the entire DBpedia extended abstracts), but also improve the quality of the result. Kaur and Hornof, (2005); Kang, Zhou and Zhang (2014), has also indicated that the presence of the domain specific corpus can effectively improve the quality of corpus based semantic relatedness.
- Use of NPMI: Bouma, G., (2009), successfully used the NPMI score in calculation of co-location strength of word pairs. The value of NPMI lies in the range of [-1, 1]. Liu et al. (2009) also reported the relatively better performance of some of the variants of PMI based measures with Wikipedia corpus (i.e, complete Wikipedia document collection). This is the main reason behind the selection of NPMI score for calculation of semantic and topical relatedness.
Uses: We use this technique in entire thesis for the calculation of semantic relatedness between any two word pairs.
3. Approach
This approach is based on our work (Kumar et al. 2015). To achieve this goal, we categorize the DBpedia extended abstracts in tight clusters. To extract the contextually similar DBpedia extended abstracts for the given candidate document, we use Wikipedia anchor text communities. Here, each Wikipedia anchor text community contains set of DBpedia extended abstracts, whose title match with any anchor text in the given community. Next, we index the documents in each of the identified anchor text community by using LUCENE[1] package. We use these indexed sets in the calculation of the normalized pointwise mutual information score of bigrams. The following contains the necessary steps to calculate the normalized pointwise mutual information score of bigrams.
- Creating Wikipedia anchor text community: In the community detection scheme, we deploy the well-organized anchor text link structure of Wikipedia. Most of the Wikipedia anchor texts have a corresponding descriptive article, which in turn contains other anchor texts (related to the context of the topic) in its body text. We (Kumar et al. 2010) use this link structure in the preparation of the graph. In this scheme, we consider every anchor text as a node of the graph. Finally, we apply the edge betweenness strategy, as applied in (Newman and Girvan, 2004) to calculate the anchor text community. [Source code for community detection can be downloaded from this link]
- Creating document sets: We uniquely map DBpedia extended abstracts w.r.t., each of the identified anchor text community. For this, we use the title based matching. We also check the category information for any category-title related disambiguation. Next, we apply the LUCENE based indexing for each of the identified document categories. We use these indexes in the calculation of normalized pointwise mutual information score of bigrams.
- Calculating NPMI: We extract all Wikipedia anchor text communities, which show high cosine similarity with the given candidate document. Next, we select the related DBpedia extended abstract’s category and merge the indexes of all extracted document categories by using LUCENE (if the given candidate document matches with more than one Wikipedia anchor text communities). As, we use pre-calculated indexes of DBpedia extended abstract collections related to each of the identified Wikipedia anchor text community, so merging the required number of identified indexes is a lightweight process. Now, the final step is to calculate the strict semantic relatedness score of bigrams. Here, strict semantic relatedness mean semantic relatedness score calculated by using a text window of size 20 words (1-2 sentences approx.). The equation for normalized pointwise mutual information can be given as:
4. Analysis
As, NPMI gives the values in the range of [-1, +1]. The word-pairs having NPMI score greater than zero, generally show tight and observable semantic strength between them. After a lot of observations, we have found that word pair’s having NPMI score greater than zero can be considered as semantically important and useful for the calculation. We use this as an important constraint in the entire calculation.
NOTE: Here, we can normalize the NPMI score from [-1, +1] to [-10, +10] or any other range. This can be adjusted according to the requirements of the semantic weight in the entire calculation.
5. Reference
- Bouma, G. (2009). Normalized (Pointwise) Mutual Information in Collocation Extraction. In: Proceedings of The International Conference of the German Society for Computational Linguistics and Language Technology, pp. 31–40.
- Kaur, I., & Hornof, A. J. (2005). A comparison of LSA, WordNet and PMI-IR for predicting user click behavior. In Proceedings of the SIGCHI conference on Human factors in computing systems (pp. 51-60). ACM.
- Kang, Y., Zhou, L., & Zhang, D. (2014). An integrated method for hierarchy construction of domain-specific terms. In Computer and Information Science (ICIS), 2014 IEEE/ACIS 13th International Conference on (pp. 485-490). IEEE.
- Niraj Kumar, Venkata Vinay Babu Vemula, Kannan Srinathan, Vasudeva Varma: Exploiting N-gram Importance and Wikipedia based Additional Knowledge for Improvements in GAAC based Document Clustering. KDIR 2010: 182-187.
- Liu, Z., Li, P., Zheng, Y., & Sun, M. (2009). Clustering to find exemplar terms for keyphrase extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1 (pp. 257-266). Association for Computational Linguistics.
- Newman, M. E., & Girvan, M. (2004). Finding and evaluating community structure in networks. Physical review E, 69(2), 026113.
- Kumar N., Srinathan K. & Varma V. (2015). ‘Unsupervised Deep Semantic and Logical Analysis for Identification of Solution Posts from Community Answers’, Vol-7, Issue-3, International Journal of Information and Decision Sciences[2].