June 29, 2007

Rudi Studer on the Semantic Web

Another Google TechTalk about the Semantic Web, this time by Rudi Studer. In this talk he introduces the Semantic Media Wiki, talks about pattern based information extraction and the ORAKEL system that allows to ask complex queries in natural language.

Its a good talk, and (unless you are very familiar with these systems already) you'll not regret watching it.

Labels:

June 28, 2007

Defining Folksonomy

Recently I skimmed over the (interesting) proceedings of the "Bridging the Gap between Semantic Web and Web 2.0" workshop. What surprised me was, that for all the papers talking about Folksonomies, there was no convincing definition of Folksonomy. So I thought I share one with you:

A Folksonomy is the computer stored record of the use of labels by many people.

This might be a bit surprising at first, but I think it'll become clearer when I discuss two often used candidate definitions:

First, wikipedia: "A Folksonomy is a user generated taxonomy used to categorize and retrieve web content such as Web pages, photographs and Web links, using open-ended labels called tags". The problem with this definition (and all that argue a similarity to taxonomies) is that the most salient feature of a Taxonomy is the explicit representation of a hierarchical structure - and that's something a Folksonomy lacks. So in this sense a Folksonomy is more like a controlled vocabulary - except that it isn't controlled .... so that leads nowhere.

Second, based on Peter Mikas groundbreaking "Ontologies are us" paper some people say that a Folksonomy is a tripartite graph of persons, concepts and documents. There's nothing wrong with that, but for me that's still an incomplete definition because it does not try to capture what is represented by this graph; it only talks about the very basic structure.  

In the end a Folksonomy really only is a computer accessible sample of the use of language to name things. But then - naming things is at the core of language, conceptualizations and ontologies, and having a simple way to observe it (as imperfect as it may be) is no small thing!

Labels: ,

June 25, 2007

The Limits Of SPARQL

SPARQL is an important step forward, a valuable tool for the handling of RDF stores - I'll not dispute that. However, SPARQL has also been hailed as the query language for the Semantic Web, the solution to the problems of accessing Semantic Web data - and that it isn't. I'll tell you why:

(1) SPARQL Interfaces and Computational Cost - Lots of websites on the web today are offering some kind of interface to access their data; almost none*, however, are offering SQL interfaces. The most important reason for this is probably the fact that SQL query evaluation can impose a serious and hard to control computational burden on the servers of the company supplying the data. SPARQL isn't changing this - in fact, SPARQL even encourages more complex queries (assuming the queries are evaluated against a relational database). So it is hard to see why companies that aren't offering an SQL interface will start doing so with SPARQL.

(2) The Problems of Large Scale Federated Search - A Semantic Web search engine is getting a query, its doing a bit of query processing, asks queries to the SPARQL endpoints it knows, aggregates and reasons with the answers it gets and finally returns a result to the initial query. That's federated search in a nutshell - and it isn't going to work; not on web scale and not simply. The problems with this approach are response time and query routing. Response time because this Semantic Web search engine is going to be SLOW - its speed limited by the slowest SPARQL endpoints it has to access (plus the fact that it has to do a lot of network access). Query routing is a big challenge because the Semantic Web search engine would need to be very specific about the SPARQL endpoints it asks for an answer to a particular query - if it isn't it is going to overwhelm the endpoints with traffic very quickly. Or what would you say if your site's SPARQL interface suddenly got a request for 1% of all Google searches?** - and that possibly without reimbursement.

(3) Not all triples are equal - SPARQL knows two kinds of triples: those that exist and those that don't. Answering queries over diverse RDF data created in an uncontrolled, distributed way, however, will also need some weights on the triples, based on how often they have been stated by whom. Assume that there are 5000 sources stating (USA is_adjacent Canada) and 4 stating (USA  is_adjacent Uzbekistan) - do you then really want to treat these two triples equally?

(4) Pedigree matters*** - As I understand it, SPARQL assumes one global graph of all RDF statements. This is problematic because it allows even just one malicious file to "infect" everything. In traditional retrieval when you have one malicious file you'll have one bullshit and n-1 normal results. Assuming one global  graph just one file can make all results unreliable.

Conclusion - So in the end there is reason to doubt that many websites will offer SPARQL interfaces (1), and even if they do it will be difficult to use them to answer queries (2) . Assuming these problems could be overcome, SPARQL still has a model that is purely boolean (3) and that assumes one global graph (4) - both notions inappropriate for web scale query answering.

And yea - nothing of this is entirely new and nothing a 100% certain showstopper, its possible that all problems could be overcome. But all this should just serve as a reminder that SPARQL isn't the Semantic Web query language, at least not yet.

 

*: Facebook being the one notable exception, it offers an interface using a powerful query language (although it isn't SQL).

**: In fact publicly available data tells us that this would be only roughly 24 queries/second, but that number is almost surely much too small.

***: Yea, I know, a more appropriate title could be provenance or lineage, but I wanted to emphasize a slight difference in the concepts - that I'm not interested in where each statement came from as much as which statements stood together.

Labels:

June 21, 2007

Crowdsourcing Management

From last weeks Economist (June 16th-22nd p.67): A group in Britain is soliciting small pledges from a large number of football (=soccer) fans to takeover a football club. Once a team is acquired, "every decision - from picking players for the squad to choosing tactics and identifying candidates for transfers - will be made by the syndicates members". A coach will create the proposals of what he thinks is right for the team and the community can then vote on these proposals - that's real dedication to Web 2.0 ideas and I'm really interested to see whether the "wisdom of crowd" ideas work here (in fact I could imagine they do).

A prosaic reader might notice that this isn't so different from normal stock ownership - afterall the stock owners also get to vote on company decisions. There are, however, two important distinction: (1) The votes seem to be more concrete, more frequent and immediate and (2) no return of investment is promised, there are no dividends and I don't think that the "shares" can be sold.

Labels:

June 19, 2007

Google Tech Talk on The Semantic Web

Stephan Decker, Eyal Oren and Sebastian Kruk giving a Google Tech Talk on the Semantic Web (or actually on something like an "open structured data web" or the lowercase semantic web - they ignore reasoning and logics completely* ). Stephan talks about general stuff and SIOC, Eyal about Active RDF and facetted browsing and Sebastian Kruk about digital libraries. Quite interesting, but if you've been to Semantic Web conferences recently, you'll already know most of what is presented. Sadly a bit of the most interesting part - Stephan Decker's talk in the beginning - is missing  :(

*: intentionally, as Stephan Decker details during the questions.

Labels:

The Agile Development of Rule Bases

Another publication, it reflects on adoption of agile methodologies (in particular XP) for the use in the development of rule bases. I developed the ideas while writing an offer for a very interesting contract to develop some rule based system; still not decided whether we get the contract, but at least I got a paper out of it :)

Both reviewers gave it the highest possible score for readability - so if this topic interests you at all you can read the paper here.  

Recently, with the large scale practical use of business rule systems and the interest of the Semantic Web community in rule languages, there is an increasing need for methods and tools supporting the development of rule based systems. Existing methodologies fail to address the challenges posed by modern development processes in these areas: namely the increasing number of end user programmers and the increasing interest in iterative methods.

To address these challenges we propose and discuss the adoption of agile methods for the developments of rule based systems. The main contribution of this paper are three development principles for and changes to the XP development process to make it suitable for the development of rule based systems.

I'm the sole author and I'll be presenting it at the 16th International Conference on Information Systems Development (ISD2007) in Galway. The submitted version (still anonymous, the conference has a double blind review process) is here.

Labels: ,

June 11, 2007

The Semantic Web Programming Service Provider

(some thoughts while doing a mental retrospective of the European Semantic Web Conference)

  • It seems obvious that there is an increasing trend towards the global integration of structured data. In my mind there is no doubt that this integration will happen to an ever larger degree over the next years (and has been over the past years).
  • It is unclear what kind of integration this will be. Whether it will be a closed, centralized approach (as exemplified by Google Base),  centralized but open (like Freebase) or decentralized and open (the semantic web).
  • Assuming that the semantic web way is the right way, I'm not sure whether RDF is the right data model to base this on (yes, we could try to do it only with XML) - but it sure looks like its worth trying. 
  • For the semantic web to have any chance to take off, we need a semantic web programming service provider.

Programming Service Provider (PSP for short): The logical extension of "Application Service Provider" - instead of delivering applications it offers the infrastructure to build, run and deploy applications. Ning and Yahoo Pipes are two existing programming service provider. This model of PSP's is very important for the semantic web because the decentralized nature is imposing a burden on anyone that wants to build an sw application - she has to worry about network latency, crawling,  keeping an index up to date etc.. PSP's can take care of these problems.

So, what does a Programming Service Provider for the Semantic Web contain?

  • First of all: a local and (reasonably) up to data copy of the entire Semantic Web. This local copy needs to be ranked and as SPAM free as possible.  
  • An API to access this data (in particular this includes a way to discover URI's based on lexical resources and a way to discover subgraphs that contain information about a particular URI).
  • An environment to create applications that use this API (although access should also be possible remotely) - similar for example to the Yahoo Pipes editor.

The building blocks for this vision are starting to fall into place - PingTheSemanticWeb as a way to keep an index up to date, the Sindice lookup index presented at the ESWC or the recent DERI work about joins in very large RDF stores and the SWSE search engine ... But only if it all comes together will the semantic web have a chance to compete against Google Base/ Freebase, because only then will it become simple to write applications that use the semantic web.

Sadly I'm currently not in a position to really contribute much work towards this vision - but I'll try.

But this post wouldn't be complete without a short discussion about what has no place in this vision.

  • There is no place for heavyweight ontologies (or rules for that matter). Sure, these technologies have their merits, an important role to play and imho will become important parts of database technology. However - there are no inference technologies available or even at the horizon that can deal with web scale data; that can deal with the size, rate of change and semantic heterogeneity to be expected on the web. This is true for rules just as much as it is for ontologies. It is an interesting research challenge to try to develop new kinds of inference mechanisms that some day could - but for now we don't even have an agreed upon model of what should be inferred - much less can we compute it in reasonable time. And even worse still - there really is no compelling use case (at least none that is not AI-complete).
  • And Semantic Web Services and NLP .... I'll leave that for another time - now I need to figure out what happens BETWEEN 8:00 AM  AND 9:00 AM :)

Labels:

FOAF

Better late than never: got myself a small FOAF file. In the process I also got myself an URI (For now I'm represented by http://vzach.de/foaf.rdf#vpz).

My FOAF file is quite minimal in information content and will stay this way - since as much as I believe in the publication of machine understandable data on the web, I'm also a fan of privacy; of trying to preserve some private information in the digital age (although I'm afraid this may be a battle that is all but lost).

Update: Alright - had another look at the FOAF specification .. I wasn't aware that there are quite a bit more work related / non privacy relevant attributes and relations that I haven't filled in - so, I'll grow my FOAF file in due time :)

Labels: ,

June 8, 2007

The Perils of Tagging

If you try to find pictures of the European Semantic Web Conference on Flickr using the tag eswc2007 all you currently find are hundreds of pictures from the Electronic Sports World Cup 2007 - an event sharing the same acronym and hence the same tag. Oh the irony..

Labels: , ,

June 1, 2007

Explorative Debugging For Rapid Rule Base Development

We present Explorative Debugging as a novel debugging paradigm for rule based languages. Explorative Debugging allows truly declarative debugging of rules and is well suited to support rapid, try and- error development of rules. We also present the Inference Explorer, an open source explorative debugger for horn rules on top of RDF.

By myself and Andreas Abecker.  I'll be presenting it at the Scripting for the Semantic Web Workshop at the European Semantic Web conference next week.

You can read the entire paper here, I think its actually quite readable and worth your time (if you have any interested in how to debug rules - that is).

I haven't really decided on how much further I develop this line of research and the implementation of this debugger - Hence I' ld love to hear any feedback concerning this ideas - think its worthwhile? waste of time? better debuggers exist already? Would use such a tool?

Labels: ,