Friday, May 29, 2015

A Fast and Clever Method to Extract the Nearest Neighbor/ Contextually-Similar DBpedia Extended Abstracts for the given Text/Text-collection and Calculating Normalized Point-wise Mutual Information Score

1. Challenges

The development of a fast, dynamic and efficient technique to identify the nearest neighbor/topically/contextually similar (i.e., similar to the given text or text collection) DBpedia extended abstracts and deploying them for calculation of the semantic relatedness is the major research challenge. The major points are:

  • The presence of a large number of DBpedia Extended Abstracts (with continuously evolving nature) creates the requirements of a fast and dynamic method.
  • Extracting nearest neighbor / contextually similar DBpedia abstracts for the given text/text-collection by using cosine similarity (or other similarity measures) makes the task a very time consuming task.
  • Indexing dynamically extracted documents for the calculation of corpus based semantics is again a very time consuming task (as for each text/text-collection we may have different set of documents).

2. Some Justifications

The following contains some justifications behind (1) the use of DBpedia extended abstracts, (2) use of nearest neighbor/topically/contextually similar (i.e., similar to the given text or text collection) DBpedia extended abstracts and (3) the use of Normalized Pointwise Mutual Information Score to calculate the corpus based semantics.

  1. Use of DBpedia extended corpus: Actually, DBpedia extended abstracts are the extract summaries of the corresponding Wikipedia articles, thus, it contains only topically related information (i.e., topically related to the corresponding Wikipedia title) and hence, relatively free from noisy entries.
  2. Why using nearest neighbor/topically/contextually similar (i.e., similar to the given text or text collection) DBpedia extended abstracts?: Actually, the use of nearest neighbor/topically/contextually similar, DBpedia extended abstracts not only improve the speed of the calculation (i.e., here we are using a fraction of DBpedia extended abstracts, instead of using the entire DBpedia extended abstracts), but also improve the quality of the result. Kaur and Hornof, (2005); Kang, Zhou and Zhang (2014), has also indicated that the presence of the domain specific corpus can effectively improve the quality of corpus based semantic relatedness.
  3. Use of NPMI: Bouma, G., (2009), successfully used the NPMI score in calculation of co-location strength of word pairs. The value of NPMI lies in the range of [-1, 1]. Liu et al. (2009) also reported the relatively better performance of some of the variants of PMI based measures with Wikipedia corpus (i.e, complete Wikipedia document collection). This is the main reason behind the selection of NPMI score for calculation of semantic and topical relatedness.

Uses: We use this technique in entire thesis for the calculation of semantic relatedness between any two word pairs.

3. Approach

This approach is based on our work (Kumar et al. 2015). To achieve this goal, we categorize the DBpedia extended abstracts in tight clusters.  To extract the contextually similar DBpedia extended abstracts for the given candidate document, we use Wikipedia anchor text communities. Here, each Wikipedia anchor text community contains set of DBpedia extended abstracts, whose title match with any anchor text in the given community. Next, we index the documents in each of the identified anchor text community by using LUCENE[1] package. We use these indexed sets in the calculation of the normalized pointwise mutual information score of bigrams. The following contains the necessary steps to calculate the normalized pointwise mutual information score of bigrams.

  1. Creating Wikipedia anchor text community: In the community detection scheme, we deploy the well-organized anchor text link structure of Wikipedia. Most of the Wikipedia anchor texts have a corresponding descriptive article, which in turn contains other anchor texts (related to the context of the topic) in its body text. We (Kumar et al. 2010) use this link structure in the preparation of the graph.  In this scheme, we consider every anchor text as a node of the graph. Finally, we apply the edge betweenness strategy, as applied in (Newman and Girvan, 2004) to calculate the anchor text community. [Source code for community detection can be downloaded from this link]
  2. Creating document sets: We uniquely map DBpedia extended abstracts w.r.t., each of the identified anchor text community. For this, we use the title based matching. We also check the category information for any category-title related disambiguation. Next, we apply the LUCENE based indexing for each of the identified document categories. We use these indexes in the calculation of normalized pointwise mutual information score of bigrams.  
  3. Calculating NPMI: We extract all Wikipedia anchor text communities, which show high cosine similarity with the given candidate document. Next, we select the related DBpedia extended abstract’s category and merge the indexes of all extracted document categories by using LUCENE (if the given candidate document matches with more than one Wikipedia anchor text communities). As, we use pre-calculated indexes of DBpedia extended abstract collections related to each of the identified Wikipedia anchor text community, so merging the required number of identified indexes is a lightweight process. Now, the final step is to calculate the strict semantic relatedness score of bigrams. Here, strict semantic relatedness mean semantic relatedness score calculated by using a text window of size 20 words (1-2 sentences approx.). The equation for normalized pointwise mutual information can be given as:

4. Analysis

As, NPMI gives the values in the range of [-1, +1]. The word-pairs having NPMI score greater than zero, generally show tight and observable semantic strength between them. After a lot of observations, we have found that word pair’s having NPMI score greater than zero can be considered as semantically important and useful for the calculation. We use this as an important constraint in the entire calculation.

NOTE: Here, we can normalize the NPMI score from [-1, +1] to [-10, +10] or any other range. This can be adjusted according to the requirements of the semantic weight in the entire calculation.

5. Reference


  1. Bouma, G. (2009). Normalized (Pointwise) Mutual Information in Collocation Extraction. In: Proceedings of The International Conference of the German Society for Computational Linguistics and Language Technology, pp. 31–40.
  2. Kaur, I., & Hornof, A. J. (2005). A comparison of LSA, WordNet and PMI-IR for predicting user click behavior. In Proceedings of the SIGCHI conference on Human factors in computing systems (pp. 51-60). ACM.
  3. Kang, Y., Zhou, L., & Zhang, D. (2014). An integrated method for hierarchy construction of domain-specific terms. In Computer and Information Science (ICIS), 2014 IEEE/ACIS 13th International Conference on (pp. 485-490). IEEE.
  4.  Niraj Kumar, Venkata Vinay Babu Vemula, Kannan Srinathan, Vasudeva Varma: Exploiting N-gram Importance and Wikipedia based Additional Knowledge for Improvements in GAAC based Document Clustering. KDIR 2010: 182-187.
  5.  Liu, Z., Li, P., Zheng, Y., & Sun, M. (2009). Clustering to find exemplar terms for keyphrase extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1 (pp. 257-266). Association for Computational Linguistics.
  6. Newman, M. E., & Girvan, M. (2004). Finding and evaluating community structure in networks. Physical review E, 69(2), 026113.
  7.  Kumar N., Srinathan K. & Varma V. (2015). ‘Unsupervised Deep Semantic and Logical Analysis for Identification of Solution Posts from Community Answers’, Vol-7, Issue-3, International Journal of Information and Decision Sciences[2].




[1]http://lucene.apache.org/
[2] http://www.inderscience.com/info/ingeneral/forthcoming.php?jcode=ijids

Monday, April 22, 2013

Some observations "Multi Class classification with binary classifiers" and " Traditional Multi Class classification"


After experimenting with different dataset, analyzing different algorithms on "Multi Class classification with binary classifiers" and " Traditional Multi Class classification", we got the following points:

  1. Point-1: If you use large number of features or multiple criteria at a place, then with the increase in number of features, the quality of decision process reduces. This happens not only with "Multi-criteria decision making" but also with classification system. However, some results show that "Multi-class classification" shows better result over "Multi-class class classification achieved by using binary classifier". This is possible, if dataset contains features having at least a minimum remarkable margin, which do not lose their importance in the presence of features of other classes. Some algorithmic steps, which helps in increasing this marginal gap between features related to difference classes also increase the quality of classification. 
  2. Point-2: "Multi-class class classification achieved by using binary classifier" techniques are based upon mutual comparison (specially in the case of "One-to-One" mode of classification, where we compare each class with all other classes one by one and in error-correcting code strategy). Thus here the chances of loss of effect of margin of features are less w.r.t. the collective presence of features all classes. 
  3. Point-3: Thus the nature of data becomes more important here. According to my observations, "One-to-One based classifiers performs slightly better than or equal to "error-correcting code strategy". However some feature boosting strategies, as used with "direct Multi-class classification" will also be helpful here.  

Thursday, April 11, 2013

Multi Class classification with Binary Classifiers Part-1


In this discussion, I tried to explain techniques for "Multi Class classification with Binary Classifiers"  through the simplest possible way and finally added some observations (based on my own experience).

The popular methods to achieve multi-class classifiers with binary classifiers are: (1) "One-versus-all" [1],[3], method using winner-takes-all strategy; (2) one-versus-one method implemented by max-wins voting [4]; and (3) error-correcting codes [2]. 

(1) "One-versus-all": Generally such methods uses hierarchical division of the output space. The method uses K−1 binary classifiers (represented as node of binary tree, see Figure-1) for a K-class problem.  At each node of the tree, a simple classifier, usually a binary classifier, makes the discrimination between the different child classes  Following a path from the root node to a leaf node leads to a classification of a new pattern. Several methods have been proposed for this hierarchical classification.

Figure-1: A tree for 5-class problem

(2) one-versus-one method: Hastie and Tibshirani [4] proposed a good general strategy called pairwise coupling for combining posterior probabilities provided by individual binary classifiers in order to do multiclass classification. Since SVMs do not naturally give out posterior probabilities, they suggested a particular way of generating these probabilities from the binary SVM outputs and then used these probabilities together with pairwise coupling to do multiclass classification.

My observations: I tried to focus on some important issues that can be explored to improve the performance of both techniques to achieve multi-class classifiers (based on my own experience):

  • Point-1: The major problem with "Hierarchical methods" are: once, we assign the classes at initial stage, we cannot change it at later stages. So, miss-classification(s) at starting stage will continue till the end of the classification. Such problems may occur with data having weak or messy class boundaries.
  • Point-2: We may face same problems with probabilistic pairwise merging of binary classes, if system does not contain the facility of revision of previously merged classes.


(3) error-correcting codes [2]: Each class is assigned a unique binary string of length 'n'; we will refer to these strings as codewords." Then 'n' binary functions are learned, one for each bit position in these binary strings. During training for an example from class 'i', the desired outputs of these 'n' binary functions are speci ed by the codeword for class 'i'. With artificial neural networks, these 'n' functions can be implemented by the 'n' output units of a single network. New values of 'x' are classi ed by evaluating each of the 'n' binary functions to generate an n-bit string 's'. This string is then compared to each of the 'k' codewords, and 'x' is assigned to the class whose codeword is closest, according to some distance measure, to the generated string 's'.

My observations: Other than issues related to robustness given in Subsection "3.3 Robustness" [2], I got some issues which can be explored (i.e., may result in improvement in score)

  • Point-1: If it becomes possible to connect the features used in classification with "code-words" related to error correcting codes then I think the construction of "code-words" will be more realistic. I hope It will improve the performance.


References

  1. Ryan Rifkin and Aldebaro Klautau. Parallel networks that learn to pronounce english text. Journal of Machine Learning Research, pages 101–141, 2004. 
  2. Dietterich, T., Bakiri, G.: Solving multiclass problem via error-correcting output code. Journal of Artificial Intelligence Research, Vol. 2 (1995) 263–286.
  3. Shailesh Kumar, Joydeep Ghosh, and Melba M. Crawford. Hierarchical fusion of multiple classifiers for hyperspectral data analysis. Pattern Analysis and Applications, 5:210–220, 2002.
  4. Hastie, T., Tibshirani, R.: Classification by pairwise coupling. In: Jordan, M.I., Kearns, M.J., Solla, A.S. (eds.): Advances in Neural Information Processing Systems 10. MIT Press (1998).

Tuesday, April 19, 2011

Sentiment Analysis: Current Issues


Abstract. In this text I present a report on current issues related to automated sentiment analysis. This report contains (1) details of problem in the area of sentiment analysis (solved and unsolved both), (2) data source for sentiment analysis, (3) current techniques and tools, and (4) Limitations of these techniques and tools.

1. Introduction to Sentiment Analysis

Sentiment analysis deals with the computational treatment of opinion, sentiment, and subjectivity of texts. Sentiment analysis starts with a small question: “What other people think?”, and finally convert into billions of dollars of commercial deal. After the great success of Web-2.0, sentiment analysis became a demanding and commercially supported research field.
Actually, Web 2.0 site gives its users the free choice to interact or collaborate with each other in a social media dialogue as creators of user-generated content in a virtual community. This resulted in: social-networking sites, blogs, wikis, video-sharing sites, hosted services, web applications, mashups and folksonomies etc. Now the huge increment in internet users (see the chart below, source:
http://www.internetworldstats.com/stats.htm) increases the e-commerce dealings.

1.1Data Source for Sentiment analysis
Data used in Sentiment analysis, generally contains unstructured text data from (1) blog posts, (2) user reviews (about any product), (3) chatting record, (4) opinion poll, etc. It may contain several noisy symbols, casual languages and emotion symbols. For example, if you search \hungry" with an arbitrary number of u's in the middle (e.g. huuuungry, huuuuuuungry,huuuuuuuuuungry) on Twitter, there will most likely be a nonempty result set.
Dataset: The standard dataset for Sentiment analysis can be downloaded from:
  • Wiki Blog Lists: It contains web lnk of a large number of famous English blogs and can be obtained from : http://en.wikipedia.org/wiki/List_of_blogs
  • BLOGS06 (Macdonald and Ounis, 2006) collection: It contains 148GB crawl of approximately 100,000 blogs and their respective RSS feeds. The collection has been used for 3 consecutive years by the Text REtrieval Conferences (TREC). Participants of the conference are provided with the task of finding documents (i.e. web pages) expressing an opinion about specific entities X, which may be people, companies, filmsetc. The results are given to human assessors who then judge the content of the webpages (i.e. blog post and comments) and assign each webpage a score: “1” if the document contains relevant, factual information about the entity but no expression of opinion, “2” if the document contains an explicit negative opinion towards the entity and “4” is the document contains an explicit positive opinion towards the entity. The data set can be found at http://www.trec.nist.gov
  • Multi-Domain Sentiment Dataset (version 2.0) (http://www.cs.jhu.edu/~mdredze/datasets/sentiment/): The Multi-Domain Sentiment Dataset contains product reviews taken from Amazon.com from many product types (domains). Some domains (books and dvds) have hundreds of thousands of reviews. Others (musical instruments) have only a few hundred. Reviews contain star ratings (1 to 5 stars) that can be converted into binary labels if needed. Used in a lot of recent publications.
1.2 Problem definition
In this section, I explore the problem statements related to Sentiment analysis. I start with problem of very basic nature and finished with some unsolved problems.
Analyzing sentiment using Clear Review: Such reviews contain either negative or positive opinion about product, or topics(s).
It is very simple to identify the positive or negative sentiments. For Example:
Product Reviews: Inspiron 1525
Title: Where has customer service gone
Review:I have an inspirion 1525-it was not listed in the models to review. DO NOT BUY THIS COMPUTER!!! The LCD has cracked after less than 9 months and Dell refuses to fix it under warranty. They will not tell me why they will not fix it and now after sending it all the way to Ontario to find out why there was lines on my screen- the service depot has returned it to me and the keyboard no longer functions. I can not use it all all now-lines or not!!!
Title: Insprion 1525
Review:--I rec'd my Inspiron 1525 about 1 month ago, and I LOVE it!! It is quicker than my Dell desktop, very portable and I love Windows 7. I opted for the 6 cell battery and am so thankful that I did - I almost wish I would have got the 9 cell. So, if you are looking for an everyday computer - this one is a great deal and a great computer...but I would reccommend upgrading the battery!
RESULT: 1 out of 2(50%) customers would recommend this product to a friend.
Analyzing Sentiment using Multi-theme documents:
In such type of document problem statement does not always remain so clear. It can be categorized into several different problems and successful analysis of sentiment depends on a lot of issues including (but not limited to):
  • Some time such texts contain multiple sentiments related to two or more than two issues.
  • Some time such documents contain both kinds of sentiments. i.e. Negative and positive both. Here, the identification of most effective one is a major issue.
  • In some cases the problem can be converted into multi-subjective sentiment analysis.


Example: “(1) I bought an iPhone a few days ago. (2) It was such a nice phone. (3) The touch screen was really cool. (4) The voice quality was clear too. (5) Although the battery life was not long, that is ok for me. (6) However, my mother was mad with me as I did not tell her before I bought it. (7) She also thought the phone was too expensive, and wanted me to return it to the shop. … ”
Description: The above text contains total seven sentences. Contains, both kind of sentiments; i.e. positive sentiment w.r.t. buyer and negative sentiment w.r.t. his mother. It contains two issues, i.e. quality of product (a positive sentiment is attached with it) and cost issues (negative sentiment is attached with this issue), so decision of more important sentiment is also a problem.

2 Current Trends and Techniques
Based on above discussed problems and issues the techniques used in sentiment analysis can be categorized into following parts:
  • Document level sentiment classification: In this technique we, identify whether the given document contains positive or negative sentiment about any topic. Generally classification techniques are used to solve these issues. The general features used in these techniques are: (1) terms and their occurrence frequency (for example the use of Tf-Idf) [3], [4], [5], (2) POS taggers [2], (3) Opinion words and phrases, (4) Syntactic dependencies and (5) negative & Positive words. Ex: [2][3][4][5] and [6].
  • Using unsupervised learning: For example [7], it uses POS tagger to identify two word phrases. It estimates the orientation of the extracted phrases using the pointwise mutual information (PMI) .
  • Sentiment analysis at sentence level: techniques using this approach, considers the sentences as the source of single opinion [8], [9]. For a given a sentence s, it applies two sub-tasks: (a) Subjectivity classification: Determine whether s is a subjective sentence or an objective sentence, and (b) Sentence-level sentiment classification: If s is subjective, determine whether it expresses a positive or negative opinion.
  • Some Other Approaches: [10], It present a unified framework in which one can use background lexical information in terms of word-class associations, and refine this information for specific domains using any available training examples. [11], It analyzed the sentiment from financial documents and arose the issue of Topic-shift. It conduct an analysis of the annotated corpus, from which we show there is a significant level of topic shift within this collection, and also illustrate the difficulty that human annotators have when annotating certain sentiment categories. To deal with the problem of topic shift within blog articles, it proposes text extraction techniques to create topic-specific sub-documents, which is used to train a sentiment classier. It shows that such approaches provide a substantial improvement over full document classification and that word-based approaches perform better than sentence-based or paragraph-based approaches.
2.1 Some Freely available Tools with brief technical details
Twitter Sentiment[1] (http://twittersentiment.appspot.com/): It is freely available, simple sentiment analysis tool. It provides the following facilities:1. Brand management (e.g. windows 7), 2. Polling (e.g. obama), 3. Purchase planning (e.g. kindle), 4. Technology planning (e.g. streaming api), 5. Discovery (e.g. iphone app). Basic Techniques Used: [2], It uses N-grams (N=1, 2, and 3) to identify the emotions attached with Twitter statements, for this it uses Stanford POS-Tagger. It removes the emoticon (icons which shows emotion, i.e. J etc.), as it can misguide the final solution. Finally, it applies three different classifiers i.e. (1) Keyword based, (2) Naïve Bayes classifier, (3) Maximum entropy based model and (4) Support Vector based model, to classify the sentiments of Twitter statements


LingPipe (http://alias-i.com/lingpipe/index.html): ([3], [4], [5]) LingPipe is a computational linguistic based text processing tool-kit. It considers the sentiment analysis as classification problem. It categorizes the entire problem into two classes:
  • Subjective (opinion) vs. Objective (fact) sentences.
  •        Positive (favorable) vs. Negative (unfavorable) movie reviews.
Method Used: It uses the concept of sentence polarity. To determine this sentiment polarity, it proposes a machine-learning method that applies text-categorization techniques to just separate the subjective portions of the document. For this, as depicted in Figure 1, it uses a subjectivity detector that determines whether each sentence is subjective or not: discarding the objective ones creates an extract that should better represent a review's subjective content to a default polarity classifier.
Finally a graph-cut (basically Min-cut) algorithm is applied to partition the negative and positive sentiments.

3 Automated Sentiment Analysis: Reality

The current report on Automated Sentimental analysis tools says: “Automated sentiment analysis is less accurate then flipping a coin when it comes to determining whether brand mentions in social media are positive or negative, according to a white paper from FreshMinds [1]”.
Tests of a range of different social media monitoring tools conducted by the research consultancy found that comments were, on average, correctly categorized only 30% of the time.
FreshMinds’ experiment involved tools from Alterian (http://www.alterian.com/), Biz360 (http://www.biz360.com/), Brandwatch (http://www.brandwatch.com/), Nielsen (http://www.nielsen.com/), Radian6 (http://www.radian6.com/), Scoutlabs (http://www.scoutlabs.com/) and Sysomos (http://www.sysomos.com/). The products were tested on how well they assessed comments made about the coffee chain Starbucks, with the comments also having been manually coded.
On aggregate the results look good, said FreshMinds. Accuracy levels were between 60% and 80% when the automated tools were reporting whether a brand mention was either positive, negative or neutral.
“However, this masks what is really going on here,” writes Matt Rhodes, a director of sister company FreshNetworks, in a blog post. “In our test case on the Starbucks brand, approximately 80% of all comments we found were neutral in nature.

3.1 Some Other Limitations
As the sentiment analysis depends upon a lot of techniques including (1) data mining based techniques (i.e. classification, clustering etc. all are not 100% accurate), (2) Linguistic techniques (i.e. POS tagger, dictionaries, lexical analyzer etc. all such technique is not 100% accurate) and (3) Use of predefined opinion words or tags (the opinion words can misguides us several times).
Similarity, due to presence of lot of noisy statements in dataset, it becomes tougher to achieve the highly reliable results. 

5 Current Research Problems
In this section I have presented some problems and issues, which require more focus to achieve better result in the field of Sentiment analysis.
1. Instead of concentrating only on either (a) document level, (b) paragraph level, (c) sentence level or (d) feature based approach; can a better combination of all the above discussed technique give better result? 2. Topic-shift in sentence is still list studied for sentiment analysis. 3. We generally use sentiment labels ranging from (1) Very Negative to Very Positive: Very Negative, Negative, Neutral, Positive, Very Positive; (2) Negative to positive; (3) Negative to neutral; (4) Positive to neutral etc. In most of the paper that I read; I found they use this type of shift in classification. There should be some effect of such shifting and including this effect may give more effective result.
References
1. Turning conversations into insights: A comparison of Social Media Monitoring Tools; A white paper from FreshMinds Research 14th May 2010;FreshMinds 229-231 High Holborn London WC1V 7DA Tel: +44 20 7692 4300 Fax: +44 870 46 01596 www.freshminds.co.uk. 
2. Alec Go; Richa Bhayani; Lei Huang; Twitter Sentiment Classification using Distant Supervision; Technical report, Stanford University. 
3. Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques. EMNLP Proceedings. 
4. Bo Pang and Lillian Lee. 2004. A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts. ACL Proceedings. 
5. Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. ACL Proceedings. 
6. Chenghua Lin, Yulan He;Joint Sentiment/Topic Model for Sentiment Analysis; CIKM’09, November 2–6, 2009, Hong Kong, China.Copyright 2009 ACM 978-1-60558-512-3/09/11. 
7. P. Turney, “Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews,” Proceedings of the Association for Computational Linguistics (ACL), pp. 417–424, 2002.. 
8. R. Ghani, K. Probst, Y. Liu, M. Krema, and A. Fano, “Text mining for product attribute extraction,” SIGKDD Explorations Newsletter, vol. 8, pp. 41–48, 2006. 
9. E. Riloff, S. Patwardhan, and J. Wiebe, “Feature subsumption for opinion analysis,” Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2006.
10. Prem Melville, Wojciech Gryc, Richard D. Lawrence; Sentiment Analysis of Blogs by Combining Lexical Knowledge with Text Classification;KDD’09, June 28–July 1, 2009, Paris, France.Copyright 2009 ACM 978-1-60558-495-9/09/06. 
11. Neil O’Hare, Michael Davy, Adam Bermingham, Paul Ferguson,Páraic Sheridan, Cathal Gurrin, Alan F.meaton1; Topic-Dependent Sentiment Analysis of Financial Blogs; TSA’09, November 6, 2009, Hong Kong, China.Copyright 2009 ACM 978-1-60558-805-6/09/11.

[1] http://twittersentiment.appspot.com/