Author: Claudia Wagner | Published: 16th December 2009 | RSS |  LINK | No Comments yet. Start talking!

Twitter messages (so called tweets) contain up to 140 characters. That means messages are very short and therefor it is often difficult to estimate the topics of a message.

Hashtags are a powerful way of assigning additional information (e.g. context information) to tweets.  I think that hashtags provide a rich source of information and should therefore (if available) be exploited for estimating the topic of a tweet.

Using hashtags to estimate the topics of tweets implies to have estimated the topics/concepts denoted by these hashtags before. For human it is often possible to guess what a hashtag label means by simply looking at the label or by looking at the twitter stream belonging to a certain hashtag label. For machines, however, it is very difficult to compute meaning of hashtags.

Different approaches exist to mine meaning from text in general:

(1) Net-based measures

  • use lexical-semantic nets like WordNet
  • map labels to WordNet concepts
  • use systematic semantic relations such as hyponymy
  • Problem: unsystematic connections (i.e. associations) can not be directly computed

(2) Distributional measures

  • consider semantics on the basis of similar distributional properties of words in large text corpora
  • deduce semantic relations on the basis of co-occurences of features from text or web data
  • simple word co-occurence and word-context co-occurence can be measured
  • Problem: depends heavily on text corpus

(3) Wikipedia based approaches

  • comprise statistics from the entire text collection
  • induce the category taxonomy using classical graph algorithms
  • Problem: Using community vocabulary from one community (in this case wikipedia) to classify data from another community (e.g. twitter) –> overlap between community vocabularies is important

(4) Ontology based approaches

  • use domain ontologies
  • map labels to ontology concepts
  • use semantic relations between concepts such as isA or sameAs
  • Problem: results depend on the ontology and the overlap between ontology and concepts

To gain some insight into the quality of hashtag streams I started analyzing the twitter stream for the hashtag “#semanticweb”.  697 tweets have been collected during November 2009.

The following diagramms show the word frequency distribution over the 40 most frequent words, hashtags,

links and user in the “semanticweb” hashtag stream:

I also computed relations between these words, but I did not have time yet to create nice plots.

However, the question now is how can we learn ontologies from twitter streams and how do existing approaches perform on sparse tweets?

Author: Claudia Wagner | Published: 12th December 2009 | RSS |  LINK | 1 Comment. Leave yours!

The first LinkedData camp took place in Vienna on Monday, Nov 30, and Tuesday, Dec 1, 2009. Thanks to organizers and sponsors it was a great event: interesting keynotes, nice bunch of people and enough Punsch to warm up :)

I was mainly working with Jelena Jovanovic and Nikola Milikic on the topic of Expert Search and profiling on the Social Semantic Web. We thought this topic is worth being discussed, because an author’s expertise is a good criterion for estimating the quality of Social Web content; and estimating the quality of  Social Web content is an important problem, because everyone can easily publish content on the Social Web (which increases the information overload problem). The fact that the content published on the Social Web tend to become shorter (see microblog posts or status updates) makes content-based quality estimations difficult. The expertise of an author, however, is an indicator which can also be successfully applied on very short content.

During the first day of the camp we discussed the role of  LinkedData in the context of expert search on the Social Web; i.e. how LinkedData can help to estimate the expertise of a user.

One potential of Linked Data in this context is of course that Linked Data allow to interlink the distributed content of  a user.  That means expertise mining  tools can obviously benefit from exploiting Linked Data in order to get a more complete picture about a user, his content and his activities on various platforms.

However, we also discussed how Linked Data can help to mine topics from Social Web content. Expertise is topic-sensitive. That means, if we want to mine expertise from all items generated by a user (or items about a user) we need to estimate the topics of these items (if they are not explicitly defined). Linked Data is exploited by various Named Entity Recognition services such as OpenCalais,  Zemanta or Muddy Boots. However, we are not aware of any approach using Linked Data as background knowledge to mine topics from text (via computing semantic relateness between Linked Data concepts and any input text).

On the second day we thought about expertise heuristics for the Social Web and we came up with the following expertise heuristic categories:

(1) Connection-based heuristics:

These heuristics are based on analysing the explicit/implicit social network of expert candidate. The assumption of this class of heuristics is that experts tend to be connected with other experts.

For example one concrete heuristic belonging to this class would be: Experts about topic X may have more conversations/discussions with other experts about topic X than non-experts have.

(2) Heuristics based on user pragmatics:

These heuristics are based on analysing usage behaviour of expert candidate. The assumption of this class of heuristics is that experts tend to behave similar (in certain situations or certain time intervals).

For example one concrete heuristic belonging to this class would be: Experts tend to receive (and/or answer) more questions than non-experts.

(3) Heuristics based on the semantic of content:

Analyse semantic of content about expert candidates or content published by expert candidates. The assumption of this class of heuristics is that experts about topic X will use a similar vocabulary or writing style if they talk about X.

For example tow concrete heuristics belonging to this class would be: Experts tend to use higher level of detail if they talk about their expertise topic. Therefore, they will use a bigger vocabulary as non-experts when talking about X. Experts may tend to publish “informative content” about their expertise topic rather than questions or advertisement.

Our conclusion after these 2 days was that we see a clear benefit of using Linked Data for mining expertise from Social Web data. Nevertheless, to our best knowledge no systems or tools exist which exploit LinkedData for expert search and explore potential benefits and pitfalls. If anyone is aware of work in this area, please drop me a line.

Author: Claudia Wagner | Published: 07th June 2009 | RSS |  LINK | No Comments yet. Start talking!

After a very exciting week in Crete I am back in Graz. Interesting people, interesting talks and great atmosphere!

Put my pics on flickr: http://www.flickr.com/photos/31208015@N04/

Olaf Hartig won award for best paper in research track with his paper “Querying Trust in RDF Data with tSPARQL“.

CONGRATULATIONS :)

Georgi Kobilarov et al won award for the best paper in use track with their paper “Media Meets Semantic Web – How the BBC Uses DBpedia and Linked Data to Make Connections”.  I really enjoyed the presentation Georgi gave together with Silver Oliver! Slides can be found here.

Unfortunately there is no award for the most entertaining talk.

In this category I would definitely nominate Harry Halpin for his talk about the Identity Crisis which he gave together with Valentina Presutti.

I also really enjoyed the lightning talks organized by Micheal Hausenblas. Slides can be found here.

Author: Claudia Wagner | Published: 05th March 2009 | RSS |  LINK | No Comments yet. Start talking!

Last weekend I left KMi and I am looking back to 5 very interesting month. KMi is a wonderful workplace with very nice people! It was really a good experience and I plan to come back for a visit soon :)

Author: Claudia Wagner | Published: 05th March 2009 | RSS |  LINK | No Comments yet. Start talking!

I just extended the SIOC WP Exporter a bit more.

Now it pings sindice. Via admin settings you can enable or disable it. Get it here.

There is a new ARC Version from 2009-03-05, which is used by the exporter.

Author: Claudia Wagner | Published: 19th February 2009 | RSS |  LINK | 4 Comments. Leave yours!

Somehow I have some problems with the sioc:content property, because the fact that it is limited on plain text seems to limit me as well :)

I wonder how I should model for example a blog post which consists only (or mainly) of images? I can describe each image itself with foaf or exif ontology, ok. But how do I express that these images are the content of the post? sioc:embeds would be a possibility and I am using it at the moment in my extended SIOC WP exporter.

But why am I not happy with this solution?

First I have no possibility to model relations between different content items (text blocks, images, videos) and positional information  (which I might need to serialize the RDF model of a post in XHTML/RDFa - seeAlso posting from Uldis) get lost during exporting a post in RDF. The triples of a post only expose that some multimedia content and some unstructured text is contained in the post. If we are looking at the HTML representation of a post, we get a richer picture. We have sections, text belonging to images and so on…

I suggest to change the range of the sioc:content property from rdfs:Literal to rdfs:Resource (or sioc:Item). This would allow to model the content of sioc:Items on a higher granularity level. For example the relations (positional relations such as sioc:next, sioc:before and argumentative relations such as sioc:arises_from, sioc:agrees_with ) between individual content items could so be exposed as well.

My second point is that the name sioc:content seems to be confusing to me. A sioc:Item can be everything stored in a sioc:Container and the sioc type module introduces container such as ImageGallery, Video Channel and so on. I wonder if  a property sioc:content, which can only holds plain text, can ever expose the content of the items stored in this containers?

Maybe it would be better to rename the current sioc:content property (which has a rdfs:Literal as range) to sioc:text_content or sioc:text and  have another property sioc:content which has as range rdfs:Resource.

In the end I would like to be able to express something like this:


Author: Claudia Wagner | Published: 16th February 2009 | RSS |  LINK | 8 Comments. Leave yours!

I extended Uldis Bojars WP SIOC exporter to also export semantic metadata which are embedded in the HTML content of a Posting.

Why is this useful?

Semantic metadata embedded in the HTML of a posting’s content can reveal more information about the topic of a posting (i.e. about what a posting is about). Tools such as Structured Blogging (http://structuredblogging.org) or the Semantic Reblog prototype I am working on, embed semantic metadata directly into the HTML content of postings.

The WP SIOC Exporter relates at the moment the whole plain text of a post’s content with the resource representing the posting itself via the sioc:content property. The html representation of the post’s content is related with the resource representing the post via the content:encoded property. Additionally links are extracted from the post’s content and related with the post via the sioc:links_to property.

My extended version of the exporter also extracts images from posts’s content and relates them via the sioc:embeds property with the resource representing the post. If the image is a flickr image an rdf:seeAlso link is generated that points to the RDF description of the image obtained via Masahide Kanzaki’s  wrapper. Furthermore semantic metadata,  which are embedded in the HTML content of a post, are extracted and relate with the post via a sioc:embeds property. I am not sure if sioc:embeds is the best  property to relate the embedded entities with its container post. Maybe something like sioc:topic would be better. However the URI of the embedded resources are related with the post URI and the parts of the resource description, which has been embedded, are also exposed (because if only parts of a resource’s description are reused or embedded in a post’s content, it might be also interesting for machines to know which parts have been reused/embedded in the posting and if the reused/embedded resource is described via microformats, it might not have an URI which identifies the resource).

I use the ARC2 library (version from 2009-02-12 -> it is important to use this version or higher) which provides a parser to extract different embedded semantic metadata formats such as RDFa, eRDF and MF. I modified the method declaration of the toRDFXML method in the ARC2_Class.php file . Thats why at the moment “my” version of the ARC2_Class.php must be included to the SIOC Exporter arc folder. But Benjamin already told me that the modification will be included in the next ARC2 version.

If you fancy to test this version of the WP SIOC Exporter, download it here.

Any thoughts are of course welcome!

Author: Claudia Wagner | Published: 15th January 2009 | RSS |  LINK | No Comments yet. Start talking!

Few days ago I decided to write simple Wordpress Plugin, which annotates the content of the current page with RDFa and does not require to change the theme manually (except maybe the header.php).

First I thought the plugin will consist of few filter functions which filter the current theme and enrich some template tags (such as the_content(), the_title()) with RDFa annotations (e.g. the_content() tag should be substituted with <span about=”<?php the_permalink() ?>” property=”sioc:content”> the_content()</span> ).

In theory that was a good idea, but I didn’t consider that I must check in which context the template tags appear :). For example the_title() tag is often used in the <a> HTML tag to expose the link title. These title should of course not be enriched with RDFa.
It turned out to be impossible (or at least very difficult) to implement such a context aware filter function within a Wordpress Plugin.

So I had to change the strategy. As automatic generation of RDFa was not possible, I considered the possibility to do it manually. Gautier Poupeau describes here how to create manually RDFa annotations for WP themes.  Well, the first disadvantage of manually creating RDFa is of course that people will not do it and need certain knowledge about RDFa to be able to do it. Furthermore making all these annotations manually seems to be a very error-prone process. I also wondered how many information should be annotated with RDFa? Only displayed items? Or should theme developer also include hidden elements to make more information available for machines?  Some themes do not include information items (such as author’s full name, email and so on) simply because of the layout. Is it in this case desirable that  these information are also not available for machines?

I think per default (i.e. if the data owner does not define explicitly something else) all data should be available in a machine readable way. The application which renders the data decides based on whatever (e.g. layout) which data is finally displayed.

That was why I finnaly decided to manually include only few RDFa annotations to my blog, which simply expose which DOM nodes  belong to which externally described RDF resource:

<div class="post" id="post-<?php the_ID(); ?>" about="<?php the_permalink(); ?>" typeOf="sioc:Post">
<a rel="rdfs:seeAlso" href="<?php echo $post_uri ?>" type="application/rdf+xml" />

This solution has the advantage that only on each position where a Post or a Comment is described the RDFa annotation is added. This very sparse RDFa information is on the one hand enough to get the positional information someone might need (e.g. if something is selected and should be stored to a web clipboard) and on the other hand can be quickly added without wasting the theme.

Author: Claudia Wagner | Published: 12th January 2009 | RSS |  LINK | 2 Comments. Leave yours!

I will quickly try to summarize the state of my work for my supervision meeting today.


Done:

*) Identified and described Data Portability problems in the Social Web –> argued that Semantic Web Technologies provide the required mechanisms to overcome the identified limitations

*) Developed Use Cases and derived requirements

*) Checked if existing Portability Tools can answer my requirements


To Do now:

Develop demonstrator which shows how Semantic Web technologies can be used to overcome the identified limitations

Demonstrator consists of:

*) Browser Extension which allows to extract and reuse content and resources of an source application site on a specific target application site. The target application site must be a Wordpress Blog, because the Browser extension implements the WP API. The source application site must embed the semantic metadata via RDFa. External semantic metadata are not used by the Browser Extension, because its difficult to find out to which resource of an external RDF file a selected content part belongs. The positional information provided via the RDFa serialization makes it easy to find out to which resource a selected content part belongs.  For Wordpress only an RDF/XML exporter exist, but no RDFa Exporter or Embedder.

*) Wordpress Plugin which embeds RDFa into selected theme. If also external semantic metadata exists in an RDF file, the RDFa metadata can simply link to it.

*) Modify Wordpress Press-This component to allow reuse selected content by value and reuse resources by embedding references. If content is reused by value the Press-This component should expose from which resource (not page!) the content originates from. If the content is reused by reference the Press-This components embeds the URI of the external resource to the Post content and marks it via a sioc:embeds attribute.

*) Wordpress Plugin (simple Filter) which allows to reuse embedded resources by reference. Whenever a sioc:embeds RDFa attribute is found inside the content of a posting, the description of the posting is fetched and at least the content of the external post is displayed.

Author: Claudia Wagner | Published: 09th January 2009 | RSS |  LINK | No Comments yet. Start talking!

I just read this interesting Blog Posting from Uldis, in which he talks about automatically creating connections between a HTML page and its associated external RDF file via RDFa. A lot of web sites expose their data via HTML sites and external RDF files, which are only connected via one link.

For example the HTML sites of this blog is connected with its external RDF data (generated by the SIOC Wordpress Exporter) via this link:

The basic question is how can the DOM nodes of an HTML page be automatically connected (e.g. via RDFa attributes) with the external description of the resource to which the nodes belong?

I like this idea of creating additional connections between HTML sites and related RDF files, because

1) they enrich the external RDF data with positional information (layout information) –> an agents can then look up for example in which order some embedded resources appear in the content of a certain blog post

2) they make it easier to find the semantic metadata belonging to a certain piece of HTML content –> for example think about a situation in which a user selects some content of a HTML page and a client site applications wants to get the semantic metadata of the resource to which the content belongs. The client site application knows to which DOM node the selected content belongs and could exploit the additional information which relate certain DOM nodes with external RDF information.

3) they make it possible to have different amounts of machine-readable and human-readable data which are still connected. This can be desirable if for example some data are simply not displayed on the HTML pages in order to not break the layout and design of a certain page. In this case no reason exist to also not make these data available in a machine-readable form. If all semantic metadata are directly embedded in the HTML page via RDFa than the machine-readable and human-readable amount of data must be the same or the HTML page must be wasted with a lot of hidden HTML tags.