October 8, 2008

Large Scale Uses of RDF

In a recent post ReadWriteWeb laments the little use of RDF in commercial applications. While the general point is valid, they miss quite a few large scale uses of RDF that I wanted to share with you:

1) The largest use of RDF in a real web setting: FOAF, and in particular its support by Google's Social Graph API.

2) XMP, the format used by Adobe to embed metadata in PDF (and other) files. Its most commonly stored in a subset of RDF. With all the Adobe tools, this is deployed on more than a hundred million computers.

3) The use of RDF in Firefox, e.g. for the description and management of extensions. Just take a look at you profile directory, you'll see.

Labels:

October 2, 2008

Tackling the Curse Of Prepayment - Collaborative Knowledge Formalization Beyond Lightweight

We finally came round to write up our ideas on how to overcome the motivation and incentive problems for collaborative heavyweight knowledge formalization:

This paper argues for collaborative incremental augmentation of text retrieval as an approach that can be used to immediately show the benefits of relatively heavyweight knowledge formalization in the context of Web 2.0 style collaborative knowledge formalization. Such an approach helps to overcome the "Curse of Prepayment"; i.e. the hitherto necessary very large initial investment in formalization tasks before any benefit of Semantic Web technologies is visible. Some initial ideas about the architecture of such a system are presented and it is placed within the overall emerging trend of "people powered search".

You can read the entire paper here. I will present it at the INSEMTIVE workshop at this year's ISWC; if you're in Karlsruhe, it would be great to see you there!

Labels: , ,

August 3, 2008

1st Workshop on Incentives for the Semantic Web

IMG_0311 At this years ISWC there is the first workshop on incentives for the semantic web about the very important question how people can be motivated to create semantic data. You can still submit papers until the 8th of August*.

Program and organizing committee include a lot of cool people, e.g  Katharina Siorpaes (the creator of OntoGame and MyOntology), Denny Vrandecic (one of the persons behind SemanticMediaWiki and the project leader of the Active IP), Andreas Schmidt (project leader of the Mature IP) and me :)

The picture to the left shows 'Schloss Gottesaue' in Karlsruhe, the location of this years ISWC and hence of this workshop.

*: Sorry for writing this so late, but I'm rather busy trying to finally finish my PhD thesis

 

Labels:

May 12, 2008

Score One For Explicit Semantics

Powerset, the most hyped 'Semantic Search' engine of recent times (e.g. here, here and here), can now finally be tried out by us mere mortals here: http://www.powerset.com/ (only searches Wikipedia and Freebase so far).

The interesting thing is that while Powerset has always focused on their know-how in natural language processing and entity recognition (that 'other kind of semantic'), the top results for almost all queries I tried (e.g. 'China size', 'Rudi Studer', 'Germany Population') are sourced from freebase - score one for (collaboratively evolved) explicit semantics, I' ld say ;)

Labels:

March 5, 2008

How Much Is That Ontology?

Ontologies are expensive to build. By now thats known to everyone and we have lots of people thinking about how they can justify the cost of building an ontology for their enterprise. Entirely wrong question - most companies don't need an ontology at all, they should go and bugger the data wharehousing companies. And another misconception is that when people think of 'expensive ontologies' they think its the formalization that makes it costy - na, for all meaningfull ontologies its creating the shared model of the domain; writing it down doesn't add that much and might even help.


And I just realized that the machine in 'shared machine understandable model of a domain' can mean soo much more than just being able to use reasoners with it - just have a look at the project to create the international barcode of life (here at Wikipedia, or watch the TechTalk embedded below)



Labels:

March 2, 2008

Collaborative Knowledge Formalization Beyond Lightweight - Tackling the Curse of Prepayment; Part II

This is the second in a series of three posts - you may wish to start with the first.

'Knowledge' Does Dot Equal 'Knowledge'

When the collaborative knowledge formalization community talks about 'knowledge' they mean something quite different from what most of the Uppercase Semantic Web community or knowledge based systems community think. The collaborative knowledge formalization community thinks of taxonomies, thesauri, skos or of structured data; the other communities are thinking of Logic Programs, Description Logics, OWL or First Order Logic. Current collaborative knowledge formalization approaches just don't support the formalisms that are commonly associated with knowledge formalization.
Now you might argue that this must be this way - that highly formal representations are just not well suited to be edited in the web2.0 style collaboration that is the topic of the collaborative knowledge formalization community. Indeed this may be the case, but its surely worth trying. There is no definite argument proving that highly formal representations cannot be edited in this way and I believe that trying to bring knowledge formalization with more powerful and more complex formalisms to the crowd will at the very least bring advances in robust reasoning and usable knowledge formalization interfaces.

The Challenges Of Using More Heavyweight Formalisms

There are, however, many challenges entailed in moving to more heavyweight formalisms. Challenges such as:

  • Usability / Debuggability: Formalisms such as OWL or First Order Logic are harder to understand, in particular errors are much harder to find.
  • Robustness: A single faulty statement added to a knowledge base with a million of axioms may break everything. Unless this problem is tackled, open collaborative knowledge formalization is impossible.
  • Performance and the  Language Expressivity / Performance tradeoff: Current reasoners for representation languages such as OWL or FOL could not dream of supporting a continuously updated knowledge base of even a fraction of the size of Wikipedia; hence something would have to give: there would have to be restrictions on language expressivity, reasoning algorithms that do not achieve soundness and/or completeness, or languages that are not purely declarative would have to be used.
  • Mixed Formality: the kind of collaborative knowledge formalization approaches discussed here rely on incremental and partial formalization- hence the data store is never fully formalized, contains data at different levels of formality. Current reasoning approaches are not well suited to tackle this.

The Curse of Prepayment - Again

All of the problems in the previous section are real and important - but there is one that trumps them all - the question of what is the immediate benefit of formalizing even small parts of a data store? What do I get from spending time and/or money on bringing a part of my data store to a more formal level? Having answered this question then allows me to decide the tradeoffs needed to address the challenges described in the previous section.

Here the collaboration knowledge formalization community has the same problem as the wider Semantic Web community: "what exactly do I get in extra benefit from using OWL? And is this worth the effort?". I believe there is an answer to that questions - but I'll describe it in the next installment of this series*.

* The first ever cliffhanger on this blog ;)

Labels: ,

February 20, 2008

Collaborative Knowledge Formalization Beyond Lightweight - Tackling the Curse of Prepayment; Part I

The Curse Of Prepayment
The Curse of Prepayment is also often referred to as the Chicken-Egg problem of Semantic Technologies: Semantic Technologies promise great functionality once a great amount of knowledge is formalized. And because knowledge formalization is difficult, often not well supported and cumbersome you need to make a great up-front investment before you see any functionality. Now this insight is not new at all, there are already numerous approaches that try to address it; of particular interest here are approaches that try to harness web2.0 ideas for this task. These web2.0 approaches to knowledge formalization can be roughly separated into two groups

  1. The first group is based on the observation that lots of people are successfully creating structured data with tagging applications. These approaches then try to extend these systems with a bit more structure, a bit more formality. Our own soboleo system, GroupMe, Int.ere.st, Bibsonomy and gnizr are examples for these kinds of systems.
  2. The second group of systems start from the observation that people are spending large amounts of time creating semi-structured data in wikis. These system then try to give people the tools and the support such that they can create data with more structure, more formality. The Semantic Media Wiki, Freebase, IkeWiki and MyOntology are example for these kinds of systems.

Making Every Penny Count, Immediately
What makes these systems interesting, what gives them a chance to tackle the Curse of Prepayment are five closely related properties:

  • Simple: Formalization is simple, can be done with little training, little effort and not only by logic experts.
  • Collaborative: Formalization can be done jointly in a group - in this way the cost is spread over multiple persons; the prepayment needed from every person is reduced. 
  • Incremental: Not everything needs to be formalized at once, formalization can be done incrementally.
  • Partial: The tools can work with data stores that are only partly formalized, that contain data at different levels of formality.
  • Immediate: Formalized data can be used immediately, immediately brings some benefit to the user.

Together these five properties can be summed up as: "Making Every Penny Count, Immediately". There is an immediate benefit for formalizing even small parts; and because these systems are simple and collaborative, formalizing these small parts is relatively cheap.

The exact nature of this 'immediate benefit' differs between the systems mentioned above, for example it is:

  • Tables and less redundant data: The unique selling point of the Semantic Media Wiki: as soon as just a few attribute values have been specified, these can be used to create tables and overview pages that before had to be maintained manually.
  • Hierarchical Organization: In systems like Soboleo or Bibsonomy tags can be organized hierarchically, this allows for more effective maintenance of the tag repository as well as for more effective navigation and retrieval. This works after having just one such relation.
  • Advanced Search: For example in the SOBOLEO system adding just one synonym for a tag/concept will already improve the search experience, searching for this synonym will then also consider the documents annotated with the topic.

This post is the first in a series of three posts, the next will focus on the challenges for collaborative knowledge formalization we encounter when moving beyond the very lightweight formalisms currently employed in the tools mentioned above. 

Labels: ,

December 26, 2007

Accessing SPARQL Endpoints from within Yahoo Pipes

Well, at least until the 'Semantic Web Pipes' are ready for prime time: a webservice that allows to query sparql endpoints from within Yahoo Pipes. Look at the example below: It shows a simple pipe that takes a name as input, uses it to query the dblp sparql endpoint and returns the result as web page, JSON and RSS. You can try the pipe here. Surely getting an RSS feed for the publications from dblp could have been achieved without RDF-SPARQL-Pipes, however, we can now access all kinds of SPARQL endpoints and have the entire functionality of Yahoo Pipes at hand to combine it with other (possibly non-SemWeb) content.

sparqlr

Let me quickly explain the pipe: The 'Please enter the name' element defines the 'name' input to the pipe. The 'String Builder' block uses this name to build a sparql query and the 'Item Builder' combines the query and the endpoint URL (http://www4.wiwiss.fu-berlin.de/dblp/sparql, in this case) into an item that will be send to the web service. The web service (that lives at http://soboleo.fzi.de:8080/PipesSparqlr/sparql [1]) takes the query and endpoint URL, sends the query to the endpoint and translates the answer to a simpler JSON format[2]. Any error encountered is simple returned instead of a result - so you are able to see it in the debugger view of Yahoo Pipes.  The last operator, the Regex element, removes anything but characters from the item's title - sadly that's necessary because somewhere along the line the character encodings get mixed up and this is tripping up Yahoo Pipes so badly, that no result is returned as soon as one of the titles contains something like for example a German 'ä' or 'ö'. I'll try to fix this someday. The source code for the webservice (all ~100 lines of it ;) ) is available here - feel free to use it anyway you like. You'll need the JSON library and Java 1.5+ to compile, and some servlet container (I use tomcat 5.5.something) to run it. 

[1]: Feel free to use this webservice but don't count on it staying there forever.
[2]: Just passing through the SPARQL query result XML caused problems with Yahoo Pipes which expects either JSON or RSS.

Labels: , ,

December 9, 2007

Defining How An Application Can Be Semantic

There seems to be quite a bit of confusion about the different meaning "Semantics" can have in computer science - as you can see for example from read Read/Write's 10 Semantic Apps to Watch; an interesting article but one that starts with a nonsensical classification of Semantic applications into 'top down' and 'bottom up'. So - an attempt to give a better classification of the different ways in which an application can be 'Semantic'.

A Semantic application is one that tries to improve some computing task by explicitly considering the meaning and context of the symbols it is manipulating. This is still very nonspecific, but will become clearer when we consider the four ways in which this can be instantiated:

  1. Semantics as in "The Semantic search engine Powerset". These approaches use natural language processing  techniques to give context to words in texts; e.g. to understand that a string "SAP" in a document refers to the company as opposed to a striver.
  2. Semantics as in "The lowercase semantic web". These approaches try to build the web of data by using machine understandable markup and establishing information interchange formats; e.g. by embedding <a  href= "http://technorati.com/tag/SemanticWeb" rel="tag"> in this page I can associate it with the topic Semantic Web; in this way give this document some context. I've used microformats for this example, but many applications of RDF are Semantic is this sense.
  3. Semantics as in "The Semantics of OWL 1.1" - these approaches define the meaning of symbols by associating them with a mathematical theory that exactly defines what follows from any collection of symbols.
  4. Semantics as in "Semantic Portal". These approaches use technologies that allow to flexibly represent data without a fixed schema; technologies such as RDF that make it easy to represent diverse data that is interconnected in a myriad ways. Twine is an example for an application that's semantic (mainly) in this sense, TripIt and Freebase as well.

(and yes, many applications are semantic in more than one sense).

Labels:

December 8, 2007

CfP - Social Aspects of the Web

IMG_2781 I thought this might interest readers of this blog: the 2nd workshop on social aspects of the web - SAW 2008 (for which I happen to be on the PC) is looking for contributions to be submitted on the 12th of January. Topics include privacy in the social web, communities on the web, large scale social web mining and empirical studies, social software on the Semantic Web... the full CfP is available here.  The workshop is held in conjunction with the 11th International Conference on Business Information Systems (BIS2008), which- colleagues who've been there tell me - is a good conference.

The picture to right is from the scenic Grossglockner Hochalpenstrasse - which is close to Innsbruck and hence to the location of workshop and conference. Most beautiful mountain road I've ever driven.

Labels:

December 2, 2007

We Need A New RDF Schema Language

More or less all people with less than a year of Semantic Web experience misunderstand RDF(S). They try to say something like that an animal can have an attribute number_of_legs of type (positive) integer and end up saying something like that everything that has a positive number of legs is of type animal. The common response to such mistakes is to lecture them about logics, the open world assumption and about open architecture - when it should be to go and design a schema language that conforms to their expectations. I'm not saying that RDF(S) should be discarded, but that there is a clear need for another language, an RDF DTD, that allows to restrict what an RDF document should look like. In a future of automatic interoperability through formalized background knowledge this may not be needed; but with the current state of the Semantic Web where almost all applications rely on RDF data to conform to some schema in this DTD sense - such a language seems to be urgently needed.

Labels:

October 27, 2007

Mind the Web

This paper argues that a significant part of today's Semantic Web research is still dominated by ideas from centralized databases. Furthermore, the main thread of reasoning research focusses on approaches that can never scale to anything similar to the Web. Starting from these negative observations we argue that emergent semantics and ontology maturing are more suitable approaches for dealing with ontologies on the Web. Similarly, a few approaches for more Semantic Web appropriate reasoning exist, but are in dire need of realistic use cases.

A paper by me, Andreas Abecker, Denny Vrandecic, Imen Borgi, Simone Braun and Andreas Schmidt; Denny will present it at the Workshop: "New forms of reasoning for the Semantic Web: scalable, tolerant and dynamic". The entire paper is here.

The paper reflects my frustration with the fact that a large part of Semantic Web research (in particular in Europe) is concerned with something like "Semantic Databases" and not even trying to tackle the challenges unique to the Semantic Web. The paper had a bit of a strange creation process and ended up being not very controversial, but I think its still useful, particular since it collects and orders references to innovative work related to the Semantic Web (as opposed to Semantic Databases). 

Labels: ,

October 3, 2007

Semantic Web Agents A Reality?

Even without the Semantic Web? From the Los Angeles Times:

But I was surprised at how much they could do. Once I had registered at the website, I uploaded some personal data, such as my frequent-flier account numbers, and the names and phone numbers of my dentist, hairdresser and doctor. If I wanted an assistant to make purchases on my behalf, I could also load credit-card information in encrypted form.
Sitting on my couch at 1 a.m., I dashed off a flurry of requests via e-mail:
* Contact all my frequent-flier airlines and inform them that I had recently changed my last name and wanted my accounts updated.
* Schedule a teeth cleaning for sometime in the next few weeks, any time before 9 a.m.
* Make an appointment for a haircut.
* Find out how much an airline ticket to Las Vegas would cost on Labor Day weekend.
Within 30 minutes, there was an e-mail in my in box saying that my requests were being processed. By noon the next day, [they] had sent a list of flight options, a confirmed dental appointment and a date for my haircut.

It is all yours for less than $1 per task ... alas all task are performed by people in low income countries and not by inference engines ... ahh well, at least these "agents" are less dependent on perfect markup.

Labels:

August 16, 2007

Metrification Matters

Nice presentation on the value and difficulties of having a common exchange language (for measurements in this case). Well presented, including such interesting things as: why the metric system is actually British, its relation to the search for an universal language and what was the first "metric car".

The video is embedded below or at Google Video here.

Labels:

August 4, 2007

Slides of ESTC Keynotes

Had this lying around for a while: on Richard Benjamins Blog there are the slides of the invited speakers at the first European Semantic Technologies Conference - speakers were Frank van Harmelen, Mark Greaves, Benjamin Grosof, Ora Lassila, Dave Pearson (Oracle) and Dr Susie Stephens - so quite a lineup. I enjoyed browsing the slides in particular from the first three - all slides can be found here. Below a nice graph from Frank van Harmelen's slides (he attributes it to Dieter Fensel):

Labels:

July 29, 2007

Business Rules Management vs. Semantic Web Rules

Last week I attended a one day introductory seminar by ILOG about their business rule management system (BRMS). I did it to get a better feel for the differences and similarities between these systems and logic programming (LP) /Semantic Web rule systems (SWR) - beyond the name and the idea to represent knowledge somehow as "if x then y" structures. So here's what I learned:

  • [Please see update below] Atomic Rules: The biggest difference really is that rules in BRMS's do not interact automatically. If you have a LP or SWR system with the knowledge base: 
          Rule A: IF q(?A) THEN r(?A)
          Rule B: IF p(?A) THEN q(?A)
          p(x)
    and you ask the query r(?A) you get the answer ?A=x, a LP/SWR system will automatically realize that it has to apply first rule B and then rule A. A BRMS will evaluate the rules in isolation - rule B will fire but rule A will not. This is because rule A has no access to the results of rule B. A BRMS system can replicate the behavior of the LP/SWR rule base through the use of rule flows. A rule flow is a description of a sequence of rule sets that have to be applied. In the example above we could define that first rule B is applied and then rule A - but note that unlike in LP/SWR systems we have to do this manually. In real life you normally wouldn't specify single rules in rule flows, rather something like: apply first syntactic input check rules, then apply rules to create credit rating, then find appropriate department to route document to etc.
    It easy to say that LP/SWR systems are superior because of their more powerful formalisms, but remember the downside to that: they are harder to understand and implement, slower to evaluate and harder to debug. There is also a certain beauty to having procedural problem solving knowledge explicitly represented and not hidden in rules. I'm hesitant to claim that a BRMS would become better by using a more powerful SWR system, probably, but it would require hard work on the user interfaces.
     
  • Multi Paradigm: The textbook way to use a BRMS seems to be to start with a business process and to externalize complex decision points from that into decision services that are realized with rules organized in rule flows. And the rules in the rule flow can easily access other programs, webservices and objects. Hence rules are embedded in a multi-paradigm context and restricted to only the subset of tasks they are good at - much more so than in the usual discussion of LP/SWR systems.
  • Maintenance vs. sharing:  The main advantage of BRMS's used in the ILOG marketing material seems the ease of maintenance - the assumption that business logic made explicit in rules is easier and cheaper to adapt and maintain. So far there seems to be relatively little interest in sharing or selling these rules. This stands in contrast to the SWR developments that focus on sharing.
  • Semantics: Well, a formal semantic is a central topic for the LP and even more for the SWR community, in contrast the BRMS people seem to be not too concerned with that and just utilize some kind of simplified forward chaining.

And the ILOG system is quite a bit more mature than any LP or SWR system I know - but then, that is to be expected from a relatively large company that has been building these systems for years.

Update: I've removed the entire first part of this post - It represents what the ILog presenter person said, however, upon reading a bit more I realized that it just isn't true: ILog (like other BRMS manufacturers) is quite proud of their Rete based forward chaining inference engine. So, that's still a different approach (forward chaining vs. declarative semantics + different inference algorithms in the SW world) but not as different as I initially claimed. Sorry.

Labels:

July 19, 2007

On The Inevitability Of The Semantic Web

The Semantic Web, by whatever name it comes to be called, is inevitable.

This is Michael K. Bergman statement in the article "Structure Paves the Way to the Semantic Web".  And I'm only using his statement as an easily accessible example here - similar statements have been repeated thousands of times. So - is this so? Is the Semantic Web inevitable?

Well,  its easy to be certain, if only you are sufficiently vague. In general people making this statement do not give a definition for the Semantic Web - Michael K. Bergman being no exception. This way they can generalize the term until its almost without meaning and are not really making any prediction(s). Humankind is going to continue to develop better tools to organize information and these tools will be somehow grow on/out of the current Internet? Well, duh, I'm not going to argue with that.

But: isn't it part of the Semantic Web vision that this will be achieved by publishing machine understandable data in a distributed fashion similar to the current WWW? Still a very vague statement, but one that allows for alternative visions: we could also see better information organization based on natural language processing technologies or on a centralized everything-is-stored-at-Google model. So, no; this Semantic Web seems likely, but is not inevitable.

And the traditional uppercase Semantic Web vision even states that this will be achieved by building one distributed global knowledge based systems (kbs) based on traditional knowledge representation techniques - by far eclipsing all previous kbs's in scale and diversity of content. Stated this way the only thing inevitable about the Semantic Web is its failure.

So then, is the Semantic Web inevitable? Depends, please define Semantic Web ;)

Labels:

July 18, 2007

John F. Sowa on Fads and Fallacies about Logic

In a recent IEEE Intelligent Systems John F. Sowa wrote an interesting article that should be read be people interested in the logical side of the Semantic Web. Two of the quotes I particularly liked:

[...] computational complexity is important. But complexity is a property of algorithms, and only indirectly a property of problems, since most problems can be solved by different algorithms with different complexity. The language in which a problem is stated has no effect on complexity. Reducing the expressive power of a logic does not solve any problems faster; its only effect is to make some problems impossible to state.

and on Language and Logic:

What makes formal logic hard to use is its rigidity and its limited set of operators. Natural languages are richer, more expressive, and much more flexible. That flexibility permits vagueness, which some logicians consider a serious flaw, but a precise statement on any topic is impossible until all the details are determined. As a result, formal logic can only express the final result of a lengthy process of analysis and design. Natural language, however, can express every step from the earliest hunch or tentative suggestion to the finished specification.

In short, there are two equal and opposite fallacies about language and logic:  at one extreme, logic is unnatural and irrelevant; at the opposite extreme, language is incurably vague. A more balanced view must recognize the virtues of both:  logic is the basis for precise reasoning in every natural language; but without vagueness in the early stages of a project, it would be impossible to explore all the design options.

The entire article is available for free as a "preprint" here.

Labels: ,

June 29, 2007

Rudi Studer on the Semantic Web

Another Google TechTalk about the Semantic Web, this time by Rudi Studer. In this talk he introduces the Semantic Media Wiki, talks about pattern based information extraction and the ORAKEL system that allows to ask complex queries in natural language.

Its a good talk, and (unless you are very familiar with these systems already) you'll not regret watching it.

Labels:

June 28, 2007

Defining Folksonomy

Recently I skimmed over the (interesting) proceedings of the "Bridging the Gap between Semantic Web and Web 2.0" workshop. What surprised me was, that for all the papers talking about Folksonomies, there was no convincing definition of Folksonomy. So I thought I share one with you:

A Folksonomy is the computer stored record of the use of labels by many people.

This might be a bit surprising at first, but I think it'll become clearer when I discuss two often used candidate definitions:

First, wikipedia: "A Folksonomy is a user generated taxonomy used to categorize and retrieve web content such as Web pages, photographs and Web links, using open-ended labels called tags". The problem with this definition (and all that argue a similarity to taxonomies) is that the most salient feature of a Taxonomy is the explicit representation of a hierarchical structure - and that's something a Folksonomy lacks. So in this sense a Folksonomy is more like a controlled vocabulary - except that it isn't controlled .... so that leads nowhere.

Second, based on Peter Mikas groundbreaking "Ontologies are us" paper some people say that a Folksonomy is a tripartite graph of persons, concepts and documents. There's nothing wrong with that, but for me that's still an incomplete definition because it does not try to capture what is represented by this graph; it only talks about the very basic structure.  

In the end a Folksonomy really only is a computer accessible sample of the use of language to name things. But then - naming things is at the core of language, conceptualizations and ontologies, and having a simple way to observe it (as imperfect as it may be) is no small thing!

Labels: ,

June 25, 2007

The Limits Of SPARQL

SPARQL is an important step forward, a valuable tool for the handling of RDF stores - I'll not dispute that. However, SPARQL has also been hailed as the query language for the Semantic Web, the solution to the problems of accessing Semantic Web data - and that it isn't. I'll tell you why:

(1) SPARQL Interfaces and Computational Cost - Lots of websites on the web today are offering some kind of interface to access their data; almost none*, however, are offering SQL interfaces. The most important reason for this is probably the fact that SQL query evaluation can impose a serious and hard to control computational burden on the servers of the company supplying the data. SPARQL isn't changing this - in fact, SPARQL even encourages more complex queries (assuming the queries are evaluated against a relational database). So it is hard to see why companies that aren't offering an SQL interface will start doing so with SPARQL.

(2) The Problems of Large Scale Federated Search - A Semantic Web search engine is getting a query, its doing a bit of query processing, asks queries to the SPARQL endpoints it knows, aggregates and reasons with the answers it gets and finally returns a result to the initial query. That's federated search in a nutshell - and it isn't going to work; not on web scale and not simply. The problems with this approach are response time and query routing. Response time because this Semantic Web search engine is going to be SLOW - its speed limited by the slowest SPARQL endpoints it has to access (plus the fact that it has to do a lot of network access). Query routing is a big challenge because the Semantic Web search engine would need to be very specific about the SPARQL endpoints it asks for an answer to a particular query - if it isn't it is going to overwhelm the endpoints with traffic very quickly. Or what would you say if your site's SPARQL interface suddenly got a request for 1% of all Google searches?** - and that possibly without reimbursement.

(3) Not all triples are equal - SPARQL knows two kinds of triples: those that exist and those that don't. Answering queries over diverse RDF data created in an uncontrolled, distributed way, however, will also need some weights on the triples, based on how often they have been stated by whom. Assume that there are 5000 sources stating (USA is_adjacent Canada) and 4 stating (USA  is_adjacent Uzbekistan) - do you then really want to treat these two triples equally?

(4) Pedigree matters*** - As I understand it, SPARQL assumes one global graph of all RDF statements. This is problematic because it allows even just one malicious file to "infect" everything. In traditional retrieval when you have one malicious file you'll have one bullshit and n-1 normal results. Assuming one global  graph just one file can make all results unreliable.

Conclusion - So in the end there is reason to doubt that many websites will offer SPARQL interfaces (1), and even if they do it will be difficult to use them to answer queries (2) . Assuming these problems could be overcome, SPARQL still has a model that is purely boolean (3) and that assumes one global graph (4) - both notions inappropriate for web scale query answering.

And yea - nothing of this is entirely new and nothing a 100% certain showstopper, its possible that all problems could be overcome. But all this should just serve as a reminder that SPARQL isn't the Semantic Web query language, at least not yet.

 

*: Facebook being the one notable exception, it offers an interface using a powerful query language (although it isn't SQL).

**: In fact publicly available data tells us that this would be only roughly 24 queries/second, but that number is almost surely much too small.

***: Yea, I know, a more appropriate title could be provenance or lineage, but I wanted to emphasize a slight difference in the concepts - that I'm not interested in where each statement came from as much as which statements stood together.

Labels:

June 19, 2007

Google Tech Talk on The Semantic Web

Stephan Decker, Eyal Oren and Sebastian Kruk giving a Google Tech Talk on the Semantic Web (or actually on something like an "open structured data web" or the lowercase semantic web - they ignore reasoning and logics completely* ). Stephan talks about general stuff and SIOC, Eyal about Active RDF and facetted browsing and Sebastian Kruk about digital libraries. Quite interesting, but if you've been to Semantic Web conferences recently, you'll already know most of what is presented. Sadly a bit of the most interesting part - Stephan Decker's talk in the beginning - is missing  :(

*: intentionally, as Stephan Decker details during the questions.

Labels:

June 11, 2007

The Semantic Web Programming Service Provider

(some thoughts while doing a mental retrospective of the European Semantic Web Conference)

  • It seems obvious that there is an increasing trend towards the global integration of structured data. In my mind there is no doubt that this integration will happen to an ever larger degree over the next years (and has been over the past years).
  • It is unclear what kind of integration this will be. Whether it will be a closed, centralized approach (as exemplified by Google Base),  centralized but open (like Freebase) or decentralized and open (the semantic web).
  • Assuming that the semantic web way is the right way, I'm not sure whether RDF is the right data model to base this on (yes, we could try to do it only with XML) - but it sure looks like its worth trying. 
  • For the semantic web to have any chance to take off, we need a semantic web programming service provider.

Programming Service Provider (PSP for short): The logical extension of "Application Service Provider" - instead of delivering applications it offers the infrastructure to build, run and deploy applications. Ning and Yahoo Pipes are two existing programming service provider. This model of PSP's is very important for the semantic web because the decentralized nature is imposing a burden on anyone that wants to build an sw application - she has to worry about network latency, crawling,  keeping an index up to date etc.. PSP's can take care of these problems.

So, what does a Programming Service Provider for the Semantic Web contain?

  • First of all: a local and (reasonably) up to data copy of the entire Semantic Web. This local copy needs to be ranked and as SPAM free as possible.  
  • An API to access this data (in particular this includes a way to discover URI's based on lexical resources and a way to discover subgraphs that contain information about a particular URI).
  • An environment to create applications that use this API (although access should also be possible remotely) - similar for example to the Yahoo Pipes editor.

The building blocks for this vision are starting to fall into place - PingTheSemanticWeb as a way to keep an index up to date, the Sindice lookup index presented at the ESWC or the recent DERI work about joins in very large RDF stores and the SWSE search engine ... But only if it all comes together will the semantic web have a chance to compete against Google Base/ Freebase, because only then will it become simple to write applications that use the semantic web.

Sadly I'm currently not in a position to really contribute much work towards this vision - but I'll try.

But this post wouldn't be complete without a short discussion about what has no place in this vision.

  • There is no place for heavyweight ontologies (or rules for that matter). Sure, these technologies have their merits, an important role to play and imho will become important parts of database technology. However - there are no inference technologies available or even at the horizon that can deal with web scale data; that can deal with the size, rate of change and semantic heterogeneity to be expected on the web. This is true for rules just as much as it is for ontologies. It is an interesting research challenge to try to develop new kinds of inference mechanisms that some day could - but for now we don't even have an agreed upon model of what should be inferred - much less can we compute it in reasonable time. And even worse still - there really is no compelling use case (at least none that is not AI-complete).
  • And Semantic Web Services and NLP .... I'll leave that for another time - now I need to figure out what happens BETWEEN 8:00 AM  AND 9:00 AM :)

Labels:

FOAF

Better late than never: got myself a small FOAF file. In the process I also got myself an URI (For now I'm represented by http://vzach.de/foaf.rdf#vpz).

My FOAF file is quite minimal in information content and will stay this way - since as much as I believe in the publication of machine understandable data on the web, I'm also a fan of privacy; of trying to preserve some private information in the digital age (although I'm afraid this may be a battle that is all but lost).

Update: Alright - had another look at the FOAF specification .. I wasn't aware that there are quite a bit more work related / non privacy relevant attributes and relations that I haven't filled in - so, I'll grow my FOAF file in due time :)

Labels: ,

June 8, 2007

The Perils of Tagging

If you try to find pictures of the European Semantic Web Conference on Flickr using the tag eswc2007 all you currently find are hundreds of pictures from the Electronic Sports World Cup 2007 - an event sharing the same acronym and hence the same tag. Oh the irony..

Labels: , ,

May 23, 2007

Semantic Web Bibliography

While writing a position paper for this years ISWC we collected a nice selection of what I would consider to be the major high level Semantic Web papers. Only almost philosophical papers, few technical details. I though that maybe (a part of) this list could also be helpful to others - as subjective and incomplete as it is. Most links don't go directly to a pdf but should help everyone (even without access to electronic journals) to get the papers with at most 4 clicks :)

Underlying Ideas:
  Allen Newell: The Knowledge Level (1980) - THE paper about defining knowledge. Ever wondered why people say that "knowledge" cannot be stored? Here you find the answer. 
  Thomas R. Gruber: Towards Principles for the Design of Ontologies used for Knowledge Sharing (1995) - An ontology is an explicit specification of a conceptualization, sound familiar? This is the paper where it's from. However, if you're really interested in understanding the ontologies, you should at least read Nicola Guarino: Formal Ontology and Information System (1998) as well.

The Idea:
  Berners-Lee, Hendler and Lassila - The Semantic Web (2001). And obviously Shadbolt, Hall and Berners-Lee - The Semantic Web Revisited (2006). I also recommend to read Frank van Harmelen: How the Semantic Web will change KR: challenges and opportunities for a new research agenda (2002) for a description of what sets the Semantic Web apart from previous KRR research. And Antoniou, Harmelen: Web Ontology Language: OWL (2004).  And on the current state the (imho a bit too optimistic) Harmelen: Semantic Web Research anno 2006: main streams popular fallacies, current status and future challenges (2006).

Ok, here now could be papers on all the topics from matching, learning to Semantic Web Services .. but maybe some other time, I'll only take some with relevance to the whole of the Semantic Web idea.

On the social dimension of the Semantic Web: Peter Mika: Ontologies are Us: A Unified Model Of Social Networks and Semantics (2005).
Real bottom up Semantic Web: Karl Aberer et al. : Emergent Semantics Principles and Issues (2004)
On Ontologies and Change: Natalya Noy and Michel Klein: "Ontology Evolution, Not the Same as Schema Evolution" (2004) and (mostly on change) Martin Hepp: Possible Ontologies, How Reality Constrains the Development of Relevant Ontologies (2007)

Fundamental (and justified) critic on the current state of the Semantic Web research can be found in Fensel, Harmelen: Unifying Reasoning and Search to Web Scale (2007) and in Kalfoglou et al.: On the Emergent Semantic Web and Overlooked Issues (2006).

On the issue of Logic Programs for the Semantic Web I recommend Bry, Marchiori: Ten Thesis on Logic Languages for the Semantic Web (2005) and Kifer et al: A Realistic Architecture for the Semantic Web (2005) as pro LP. And Horrocks et al: Semantic Web Architecture: Stack or Two Towers and Motik et al: Can OWL and Logic Programming Live Together Happily Ever After (2006) on contra.

And as the last topic for this already very long post: there are very interesting ideas surrounding the issue of massive Semantic heterogeneity (millions of partly overlapping schemas/ontologies)  not addressed by mainstream Semantic Web in these two papers: Madhavan et al: Web-scale Data Integration: You can only afford to Pay As You Go (2007) and Lopez et al: PowerMap: Mapping the Real Semantic Web on the Fly (2006).

Labels:

April 14, 2007

Ontologies And Cost

Furthermore the authors are not aware of any proof that completely representing a domain is a cost efficient solution to any business problem.

Just a sentence I wrote in a publication I work on. Re-reading it I realized that this is actually a shocking statement - does this mean that all old fashioned attempts to build ontologies are wasteful? Or is there this proof but I just don't know it?

Actually I think that this only applies to really old fashioned attempts to build an ontologie - those that actually somehow strive for completeness in representing a domain and lose track of the task an ontologies is supposed to be used for. In general this kind of thought is just another reminder that the "an ontology is a formal conceptualization of a domain" definition is incomplete - any actual ontology is an artifact created for some purpose. Forget that and you'll never finish modeling and end up with something that you can neither verify nor validate.

Labels:

April 13, 2007

Search Is Irrelevant

There was an annonyingly inprecise piece over at ReadWrite Web about Google as "The Ultimate Money Making Machine" .. but thinking about it brought me to two conclusions:



1) If you look at what really matters - money - then Google is first and foremost an advertisment brokering company. By "Our goal is to organize the worlds information" Google actually means: "Our goal is to place ads next to the worlds information" ;)



2) Hence any challenger to Google will most likely not be a better search engine but a better ad broker. And if I can speculate a bit more: this challenger will not succeed by challenging Google on "traditional" AdSense like adds, but by brokering ads in games, virtual worlds, to mobile phones (based on location), internet video .. or by better integrating old ad channels like print, tv ads, product placement ....



Of course, Google knows that - thats why they bought YouTube and a company that specializes in in-game advertisment; that's why they experiment with TV-ads and ads in print. But unlike with "traditional web ads" they don't dominate the market in this areas (yet) and hence there's a much better chance for competitors.



But well, all this brings us back to the question of Semantic Web Advertisments ;)

Labels: ,

April 12, 2007

Sitemaps and the Semantic Web

Ask.com, Google, Microsoft and Yahoo! have announced support for a new feature of the Sitemap standard - you can now link these XML files (that describe a sites structure) from the robots.txt file and don't need to manually alert every search engine to where your file is. This reminded me of the "Why the Semantic Web will fail" debate a few weeks back - remember the blog post that claimed that, I quote: "The Semantic Web will never work because it depends on businesses working together, on them cooperating". This new cooperation over the Sitemap standard is just another example of how competing businesses are cooperating, even creating metadata standards.

Revisiting this blog, however, I found that its author, Stephen Downes,  has actually addressed my critic (among others) in a second blog post. I had said:

RSS, ATOM and iCal are examples for data standards jointly supported by different companies - there's just no reason to assume that this list cannot grow.

And he replies:

Neither RSS not Atom are RDF (except for RSS 1.0, which has a usage of about 3 percent). I also posted figures on my website just this week showing that iCal usage is something like 7 percent. iCal isn't RDF either - hence the need for a converter http://torrez.us/ics2rdf/ and the resulting proliferation of RDF versions of iCal, none of them official. Meanwhile, neither Google nor Outlook are based in iCal.

Which I don't accept as a rebuttal of my argument. He said: "can't work, businesses don't cooperate, don't come up with joint standards", I said: "they do, look at these standards", he says: "that's not RDF" - but that's not the issue.  I personally don't see the vision of the Semantic Web as restricted to "only RDF", I'm fine with Semantic Web applications build with XML/RSS/ATOM. For me RSS is conceptually a pure breed Semantic Web application - whether  it's build on RDF or not. And even if I where to grant the point that these companies have yet to agree on a "real" Semantic Web standard, he then has to argue that "business do cooperate but they would never do it on Semantic Web standards" .. for which I don't see any arguments right now.

And about iCal and Outlook/Google: Well, the Outlook I use also displays the data from my Google Calendars - and the integration is done with iCal. Neither of these applications may be "based in iCal" - but they surely support it.

Sadly he didn't post a link to this analysis that lead to the "iCal usage is something like 7%" statement and I couldn't find it - so I'll probably never know the answer to the "7% of what?" question (of all Internet traffic? applications aggregating calendar data?)  ;)

Labels:

Ban the Semantic Web Layer Cake!

The good old Semantic Web layer cake ... it has served us quite well by giving some illustration to the un-illustrateable. But surprisingly there seem to be people that actually take it literally and thereby it is starting to cause more harm than good - for this reason it should be retired, never to be shown again.

Here it is, in all its glory and the most current version I could find (from Jim Hendlers "Dark Side" slides).

 

So, how can this innocent looking boxes hurt? Let me enumerate a couple of ways:

  • The layer cake gave us unreadable serializations. The ugly RDF/XML was only the start, to be followed by the even worse serialization of OWL in RDF in XML - that even hints at a RDF-OWL compatibility that isn't there.  We have to stop this before someone comes up with a serialization of Prolog in RIF in RDF in XML!
  • The idea that "Trust" is the final and last stage* to be added on top lead to the ignorance in the SW community of trust issues - even though this is one of the most important questions for the future web. 
  • The idea that we first have to build this entire "protocol stack" before real Semantic Web applications can be build was one of the reasons the Semantic Web community became so academic and self-centered.  
  • The layer cake makes it hard to bring the lowercase semantic web developments into the SW mainstream, this would require the SW community to accept that in fact you can have meaningful and helpful semantic web applications just on top of the two lowest layers.
  • Finally the layer cake facilitated the hijacking of Semantic Web research by the old fashioned logic and knowledge representation communities/ideas (which in turn lead to formalisms that for all we know are too slow and too brittle to work on a web scale). 

 

*: This "UI" layer is relatively recent, trust used to be the uppermost layer for most of the time.

Labels:

April 5, 2007

Open Pipes

Google Video has a talk about Yahoo Pipes. In general its a nice and gentle introduction to pipes, I found four tidbits of information mainly about the future plans of Pipes very interesting:

  • In the new future they plan to allow you to add your own webservices as modules. 
  • They are looking into ways to allow you to safely build Pipes on your private, password protected data (such as emails, calendars etc). Although it sounded like this is still quite a bit off.
  • Yahoo Pipes is internally build on top of XML; its agnostic to whether its RSS or XML/RDF.  In the beginning they put the focus on RSS to make the tools easier to understand.  Not sure whether this is good news - processing RDF as XML really is neither easy nor powerful (compared to processing it as RDF).
  • In general they struggle with the Power vs. Simplicity tradeoff; for example that led them to postpone the release of a database like "join" module for XML files.

Sadly they did not speak about the business case behind Yahoo Pipes, how Yahoo plans to earn money with this service.

Labels: , ,

March 21, 2007

Semantic Web Advertisements?

This post makes a pretty weak argument why the Semantic Web will fail. One of its main arguments is that it relies on the cooperation between business that just isn't going to happen. Other have already pointed out that this statement is clearly false (here and here) and I just wanted to point to RSS, ATOM, iCal and Sitemap as examples for data standards jointly supported by different companies - there's just no reason to assume that this list cannot grow.

However, I do agree that the Semantic Web community too often just naively assumes that everyone "wants to share"; that it ignores the business cases. For example there is no major work on the question of how I can monetize an investment in ontology building - even though everyone agrees that a formal ontology is difficult and expensive to build. Wikipedia like approaches are only getting as so far - most metadata will only get created if the creator sees a monetary advantage in doing so. Finding that advantage is more difficult on the Semantic Web because the data will most probably by used by a computer agent - so I can't fund the data's creation by placing ads at its side.

So - what is the equivalent funding mechanism to ads that works for the Semantic Web? Or - alternatively - how can we place ads on the Semantic Web?

Readers interested in this questions may also want to look at my Semantic Announcement Sharing paper from 2004. There we take a holistic look at the factors that made RSS a success - including the motivation of people to contribute the data. We then use this factors to identify a different domain and to create a metadata standard for it (the sharing of information of events). Back then I lacked the time to follow through and actually promote this standard - but its still the best standard for metadata about events and events are still a great domain for Semantic Web technologies :)

Labels:

March 8, 2007

On The Parallel Future Of Programming

I wrote about it before, but it deserves to be repeated a couple of times:

  1. Processors are not getting faster at processing single threaded programs anymore. In the past you could be sure that the next CPU generation will execute any program faster - this is not true anymore.
  2. CPU development centers around building more and more processing cores - hence all computing intensive applications that want to be fast need to be multithreaded.
  3. Current programming languages and tools are mostly not well suited for concurrent programs. In the next years we will see a lot of development to address this shortcoming.

At FZI we just bought our first QuadCore machines - but obviously 4 is not going to be the limit - Intel already demoed a 80 core chip.

To learn more about this you can read the posts at O'Reilly Radar here and here

Google Video also has TechTalks about a proposal to add better control abstractions to Java (could be a simple step improve concurrent programming with Java) and about MapReduce - a control abstraction Google uses to more easily take advantage of multiple processors.

There's also an enjoyable video about how a modern computer game takes advantage of multiple cores (about Alan Wake, the new game from the makers of Max Payne).  

Labels: ,

February 21, 2007

Semantic Search and Synonyms

Synonyms (and homonyms) are really the boring basis for semantic search and I'd probably be one of the first to say that someone building semantic search shouldn't spend to much time on that because it's just not exiting enough ... but, if there where a search engine that handled this well, it would have just saved me half a day. It could have told me that what the knowledge engineering community calls knowledge formulation by "Domain Experts" and "Subject Matter Experts" is called End User Programming in the Software Engineering community. And similarly that provenance (or traceability) is Lineage in the DB community. And don't get me started about Algorithmic Debugging aka Declarative Debugging, Declarative Diagnosis, Guided Debugging, Rational Debugging aka Deductive debugging.

But then - at this level these labels are often not synonyms but similar concepts - and then it might be interesting again. The query "Similarity based semantic search" still returns nothing ;-)

Labels:

February 20, 2007

The Strange Content Label Incubator Group

Today the W3C Content Label Incubator Group published its final report. I must say I'm mystified of this group. I was when I first learnt about them and I still am.

See, here's what they want to do:

In essence what's required is a way of making any number of assertions about a resource or group of resources. In order to be trustworthy, the label containing those assertions should be testable in some way through automated means.

So now you might guess that they are part of the Semantic Web community - but you would be wrong. It actually seems they are actively trying to avoid the SW label or using RDF.

Lets have a look at an example of what they say about their relation to RDF:

It is anticipated that the primary encoding will be in RDF but that alternatives will be considered: for example, extensions for RSS and ATOM to allow a default cLabel to be declared at the channel/feed level with overriding cLabels at item/entry level.

I think what they are trying to say is that they want to mostly encode their model as RDF/XML. Which immediately begs the question: are the data models compatible, the same even? Well, mostly ... more about that in a sec. And what they then should by saying in the second part of the example is that they still need ways to embed RDF in ATOM and RSS (other than 1.0, obviously).

So, what is their problem with RDF? It's the groups. RDF makes statements about resources but they want to make statements about groups of resources. Now you may point out that indeed OWL at least allows to make statements about classes and that I can describe classes in pretty sophisticated ways ... and I'm not sure whether they have thought about that, but in any case: the groups of resources they are envisioning aren't easily or naturally captured in OWL.

  • As a matter of policy, all content created after 1 January 2005 meets WAI AA standard.
  • Content created after 1 January 2006 meets the Mobile Web Initiative's mobileOK standard.
  • There is no sex or violence in any content but resources whose URLs contain the word "-pg" may portray bare breasts, bare buttocks, alcohol or gambling.
  • The content is organized in such a way that the genre of a resource (pop, film, fashion etc.) can be inferred from its host, such as http://fashion.example.com
  • All material is copyright Exemplary Multimedia Company
  • Some metadata is unique to a given resource, such as title and author. This can be accessed using a URI associated with the resource. This might be a URL, an internal ID number or the resource's ISAN number.

But then - you will have a hard time defining any formalism (other than a full fledged imperative programming language) that can. And you'll still need some metadata attached to each element - how else will I know when it was created? And in any case why? Why do I need such complex groups? And I think it is there where their argument fully collapses, here's what they say:

Rather than spend considerable time and effort to create a complete set of metadata for each resource, the Exemplary Multimedia Company wishes to group resources together for descriptive purposes.

= Because the content provider can't be bothered to apply these rules himself (which he could easily do by a script - where he would have a full fledged imperative language at his disposal). I think that's a pretty weak excuse to throw away a standard.

At some other location in the document they give a different reason for their dislike of RDF - that it does not allow defaults.  That's just false, OWL does not, but the logic programming approaches under discussion for the rule stack do ... RDF is agnostic to these things. In any case that's not the point. I'd guess the creators of filter software that a part of this community have created their filter files with these kind of groups - and they want to keep it this way. And maybe, for this kind of applications, its even making sense - as a way to preserve bandwidth .. but if this is the real reason, then it should be argued this way (although I'd still say its wrong)

There are more problems with their current document, but this post is already ridiculously long. Just one more example: they specify trust as a core problem in the mission statement and then barely touch it (that reminds me of a different community that does the same ... ahh, that'll be the Semantic Web community)

Labels:

February 11, 2007

Apple Knowledge Navigator

Apple's 1990ish vision of a computer interface of the future - not that different from descriptions of how people should interact with Semantic Web agents (Video below or watch it at Google Video).

Labels: ,

February 8, 2007

Yahoo's Pipes

From O'Reilly Radar:

Yahoo! Pipes was released today with the goal of allowing people to easily mix, match, filter, sort and merge data sources into RSS feeds. These resulting RSS feeds are called Pipes and they allow you to do things like find all of the parks in your city or convert the news to Flickr photos. The product allows you to browse pipes, search for pipes, share pipes, or clone somebody else's pipe.

More:

Yahoo!'s new Pipes service is a milestone in the history of the Internet. It's a service that generalizes the idea of the mashup, providing a drag and drop editor that allows you to connect Internet data sources, process them, and redirect the output. Yahoo! describes it as "an interactive feed aggregator and manipulator" that allows you to "create feeds that are more powerful, useful and relevant." While it's still a bit rough around the edges, it has enormous promise in turning the web into a programmable environment for everyone.

Very cool stuff, but "generalizing the idea of the mashup" - wasn't this the job of the Semantic Web? (yea, a pipe like service on RDF would be much, much, cooler).

Even more here and  here.

Labels: ,

December 24, 2006

The Real Difference Between Semantic Web And Web 2.0

From the Swoogle homepage:

Q: Do you have any plans to commercialize Swoogle?

No. Swoogle is a research project. We have no interest in commercializing the ideas or technology.

Labels:

RDF Views

There has been some work on views on rdf / ontology data - but actually most is not very useful and complicated in a strange way. I'm mystified why nobody has ever trivially transferred the ideas from relational databases (or maybe I just haven't found the work):

  1. A view on a RDF graph is an RDF graph.
  2. A view is defined by a SPARQL construct query.

Then we can either materialize these views or just compute the parts necessary to answer a query on the view (The mechanisms needed to decide which parts of a view need to be computed should be pretty straight forward extension of ideas from deductive databases). And there is is even a nice subset of SPARQL queries that could be used to define updateable views (queries for which every variable in the select parts also appears in the construct part and that do not contain unions and stuff).

Ahh well, but actually implementing this would take a while ... but I would love to have it :)

Labels:

December 18, 2006

German Quaero Now Theseus?!

It seems the German - French cooperation to build an Internet search giant has been terminated. Apparently the french wanted to focus more on traditionally search while the Germans wanted to focus on "Semantic Technologies". The German project (which - unlike the french part - hasn't officially started yet) looks set to go ahead anyway - but now named Theseus.

Strange. Even though the FZI is planning to participate in this project I hadn't heard anything about this before ...

Labels:

December 7, 2006

Ask City And The Semantic Web

Ask City is the new local search portal released by ask, and no - it's not a Semantic Web application.  But it should be.

For me one of the main new ideas I took home from this years International Semantic Web conference was that for many Semantic Web technologies there is only a limited window of opportunity to move into the mainstream. If the sw-technologies don't make it on time, other technologies will have been used to solve most of the problems that they where conceived for. The other technologies may not solve the problem as complete or as elegant - but their existence makes sw-technologies a harder sell.

Take Ask City as an example. In a way its a traditional mashup - it integrates data from (at least) CitySeach, Yelp, Judysbook, Ticketweb and Urban Mapping.  Exactly the kind of data integration challenge that the Semantic Web wanted to solve. However, its probably not created with rdf or owl because other technologies where more mature, more tools existed, people understood them better...

And there is the "window of opportunity" closing a little bit - SW technologies could solve this problem in a more elegant and flexible manner - but it just got a little bit harder to convince people of that. Its gotten a little bit harder to show a visible(!) added benefit when people already see large scale web information integration happening without rdf.

Labels:

The BAsAS Architecture For Semantic Web Annotations

A poster I presented at the 1st Semantic Web Authoring and Annotations Workshop at the ISWC 2006. 

We describe a generic architecture for the (semi-automatic) creation, storage and querying for annotations of web resources. Our BAsAS architecture uses recent advances from the Semantic Web and Web 2.0 communities to make Semantic Web annotations a reality. The BAsAS architecture makes it easy for users to start to annotate and easy for
developer to use the annotations that get created.

Besides describing the general architecture we will also detail an implementation of this architecture build for a Semantic Web community portal.

Think of it as Annotea but better. The presented system addresses some of the most important shortcomings of Annotea: that there are only plugins for the firefox browser (shutting out the majority of web users) and that there is no query language for annotations.

Actually I'm still quite annoyed that it only got accepted as poster. It was not "innovative enough", the changes to Annotea not big enough. Ahh well, I put it down to my bad writing. In a way I even agree that we don't need another Semantic Annotation Paper  - we need applications that come with a nice user interface and are usable "out of the box" (in particular without the need for the user to worry about finding a server - something you've to do with current Annotea tools).

The long version of the paper is here.  

Labels: ,

A Topic Hierarchy On The Web

We present the architecture and interface of a metadata registry for a large e-learning site. The metadata registry is very simple to integrate by both content and application providers. It takes its inspiration from currently successful metadata architectures and aims to be an evolutionary change to the web – using long established standards where possible.

Poster at the ISWC 2005, authors are Valentin Zacharias and Stephan Grimm.

The entire paper is here.

Labels: ,

Semantic Announcement Sharing

This paper stems from the idea that maybe the painstainkingly slow adoption of the Semantic Web into the mainstream www can be accelerated by taking clues from these tiny Semantic Weblets already present today.
We have identified RSS as one particularly successful Semantic Weblet, formed an opin-ion on why it was successful and have than tried to include all its success factors into a new Semantic Web application.
This paper argues that in order to build a successful Semantic Web application, considering only technical aspects is not enough; econom-ics, the motivation of the actors, necessary changes and available know how is also important.

Authors are: Valentin Zacharias and Mike Sibler

Published in the Proceedings of the Fachgruppentreffen Wissenamangement 2004.

The entire paper is here.

Labels: ,

KAON - Towards a large scale Semantic Web

The Semantic Web will bring structure to the content of Web pages, being an extension of the current Web, in which information is given a welldefined meaning. Especially within e-commerce applications, SemanticWeb technologies in the form of ontologies and metadata are becoming increasingly prevalent and important. This paper introduce KAON - the Karlsruhe Ontology and Semantic Web Tool Suite. KAON is developed jointly within several EU-funded projects and specifically designed to provide the ontology and metadata infrastructure needed for building, using and accessing semantics-driven applications on the Web and on your desktop.

In Kurt Bauknecht and A. Min Tjoa and Gerald Quirchmayr, E-Commerce and Web Technologies, Third International Conference, EC-Web 2002, Aix-en-Provence, France, September 2-6, 2002, Proceedings, volume 2455 of Lecture Notes in Computer Science, pp. 304-313. Springer, 2002.
ISBN: 3-540-44137-9

Very long list of authors (there is actually a funny story behind one name in the authors list ... but not something to write on a website. ask me)

The entire paper is here.

Labels: ,

On Knowledgeable Unsupervised Text Mining

Text Mining is about discovering novel, interesting and usefil patterns from textual data. In this paper we discuss several means that introduce background knowledge into unsupervised text mining in order to improve the novelty, the interestingness or the usefulness of the detected patterns. Germane to the different proposals is that they strive for higher abstractions that carry more explanatory power and more possibilities for exploring the input texts that is achievable by unknowledgeable means.

Andreas Hotho, Alexander Mädche, Steffen Staab and Valentin Zacharias: Text Mining Workshop Proceedings, Springer, 2002

The entire paper is here.

Labels: ,

Clustering Ontology-based Metadata in the Semantic Web

The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation. Recently, different applications based on this vision have been designed, e.g. in the fields of knowledge management, community web portals, e-learning, multimedia retrieval, etc. It is obvious that the complex metadata descriptions generated on the basis of pre-defined ontologies serve as perfect input data for machine learning techniques. In this paper we propose an approach for clustering ontology-based metadata. Main contributions of this paper are the definition of a set of similarity measures for comparing ontology-based metadata and an application study using these measures within a hierarchical clustering algorithm.

A. Mädche and Valentin Zacharuas, Proceedings of the Joint Conferences 13th European Conference on Machine Learning (ECML'02) and 6th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'02).

Entire Paper is here.

(older paper, just posted it for completeness)

Labels: ,