Mining meaning from Twitter Hashtags or do you know what #swr means?
Author: Claudia Wagner | Published: 16th December 2009 | RSS | LINK | No Comments yet. Start talking!
Twitter messages (so called tweets) contain up to 140 characters. That means messages are very short and therefor it is often difficult to estimate the topics of a message.
Hashtags are a powerful way of assigning additional information (e.g. context information) to tweets. I think that hashtags provide a rich source of information and should therefore (if available) be exploited for estimating the topic of a tweet.
Using hashtags to estimate the topics of tweets implies to have estimated the topics/concepts denoted by these hashtags before. For human it is often possible to guess what a hashtag label means by simply looking at the label or by looking at the twitter stream belonging to a certain hashtag label. For machines, however, it is very difficult to compute meaning of hashtags.
Different approaches exist to mine meaning from text in general:
(1) Net-based measures
- use lexical-semantic nets like WordNet
- map labels to WordNet concepts
- use systematic semantic relations such as hyponymy
- Problem: unsystematic connections (i.e. associations) can not be directly computed
(2) Distributional measures
- consider semantics on the basis of similar distributional properties of words in large text corpora
- deduce semantic relations on the basis of co-occurences of features from text or web data
- simple word co-occurence and word-context co-occurence can be measured
- Problem: depends heavily on text corpus
(3) Wikipedia based approaches
- comprise statistics from the entire text collection
- induce the category taxonomy using classical graph algorithms
- Problem: Using community vocabulary from one community (in this case wikipedia) to classify data from another community (e.g. twitter) –> overlap between community vocabularies is important
(4) Ontology based approaches
- use domain ontologies
- map labels to ontology concepts
- use semantic relations between concepts such as isA or sameAs
- Problem: results depend on the ontology and the overlap between ontology and concepts
To gain some insight into the quality of hashtag streams I started analyzing the twitter stream for the hashtag “#semanticweb”. 697 tweets have been collected during November 2009.
The following diagramms show the word frequency distribution over the 40 most frequent words, hashtags,
links and user in the “semanticweb” hashtag stream:
I also computed relations between these words, but I did not have time yet to create nice plots.
However, the question now is how can we learn ontologies from twitter streams and how do existing approaches perform on sparse tweets?





