At the 17th International World Wide Web Conference – organized by the World Wide Web Conferences Steering Committee and held in Beijing April 21-25 this year – Ivan Herman, the Semantic Web Activity Lead at the World Wide Web Consortium (W3C) gave a presentation about the introduction to the semantic web providing an example.

About the current web
The present web (call it web 1.0 or web 2.0) represents information using natural language (English, Hungarian, …), graphics, multimedia, page layout which we humans can process easily. While using the internet, different tasks require to combine data on the Web (e.g. hotel and travel infos may come from different sites, searches in different digital libraries, …). Humans combine these information easily even if different terminology's are used! But machines are ignorant! Partial information is unusable; it is difficult to make sense from e.g., an image; drawing analogies automatically is difficult; it is also difficult to combine information automatically (is same as ?). How to combine these different XML hierarchies?

What is needed to enable semantic web?
(Some) data should be available for machines for further processing. Data should be possibly combined, merged on a Web scale. Sometimes, data may describe other data, but sometimes the data is to be exchanged by itself. Machines may also need to reason about that data.

Ivan Herman gives a simplistic example (data integration) to introduce the main Semantic Web concepts.

  1. Map the various data onto an abstract data representation: make the data independent of its internal representation.
  2. Merge the resulting representations.
  3. Start making queries on the whole! Queries that could not have been done on the individual data sets.

1. Export your data as a set of relations:

Relations form a graph: the nodes refer to the “real” data or contain some literal; how the graph is represented in machine is immaterial for now. Data export does not necessarily mean physical conversion of the data: relations can be generated on-the-fly at query time via SQL “bridges”, scraping HTML pages, extracting data from Excel sheets, etc. One can export only part of the data.

2. Export your second set of data:

3. Start merging your data:

… and merge identical resources:

Now you can start making queries: user of the data of the book “Le Palais des miroirs” (yellow graph) can now ask queries like “donnes-moi le titre de l’original” or “give me the title of the original”. This information is not in the dataset of the book “Le Palais des miroirs” (yellow graph), but can be retrieved by merging with dataset of the book “The Glass Palace” (blue graph).

But even more can be achieved: We “feel” that a:author and f:auteur should be the same, but an automatic merge doest not know that! Let us add some extra information to the merged data: a:author same as f:auteur (both identify a “Person”), a term that a community may have already defined: a “Person” is uniquely identified by his/her name and, say, homepage; it can be used as a “category” for certain type of resources.

Use the extra knowledge…

… and start making even richer queries:
User of dataset “F” can now query: “donnes-moi la page d’accueil de l’auteur de l’original”
(“give me the home page of the original’s author”). The information is not in datasets of the book “Le Palais des miroirs” (yellow graph) or the database of the book “The Glass Palace” (blue graph), but was made available by merging the two datasets, adding three simple extra statements as an extra “glue”.

You can combine these data with other datasets. Using, e.g., the “Person”, the dataset can be combined with other sources: for example, data in Wikipedia can be extracted using dedicated tools (e.g., the “DBpedia” extracts the “infobox” information from Wikipedia)

What happened in the picture above via automatic means is done all the time, every day by the users of the Web! The difference: a bit of extra rigor (e.g., naming the relationships) which is necessary so that machines could do this, too!

What was done?
We combined different datasets that are somewhere on the web, that are of different formats (mysql, excel sheet, XHTML, etc.), and have different names for relations. We could combine the data because some URI-s were identical (the ISBN code of the books in this case). We could add some simple additional information, using common terminologies that a community has produced. As a result, new relations could be found and retrieved. This could become even more powerful if we could add extra knowledge to the merged datasets: e.g., a full classification of various types of library data, geographical information,
etc. This is where ontologies, extra rules, etc, come in (ontologies/rule sets can be relatively simple and small, or huge, or anything in between). Even more powerful queries can be asked as a result!

The Semantic Web provides technologies to make such integration possible! For example: an abstract model for the relational graphs: RDF (with different “serializations” in XML or text); extract RDF information from XML data: GRDDL; a query language adapted for the relational graphs: SPARQL; characterize the relationships, categorize resources: RDFS, OWL, SKOS, Rules (applications may choose among the different technologies); reuse of existing “ontologies” that others have produced (FOAF in our case).

If you are interested in Herman’s other public presentations visit his page!

0 comments