Gravatar done right

Warp rightfully pointed out the privacy issues of Gravatar and demonstrated how easy it is to exploit them. So is it all wrong? Certainly not. Gravatar serves a purpose − warp would certainly agree, after all he uses it himself. The problem is that sites using it don't give you a choice whether to use it or not − and when commenting on blogs you often only see afterwards what happened and they require you to enter your email address so there is no way out.
But the problem can be tackled from another side as well. Gravatar could have done much better. Let's first look at the problem they're actually trying to solve.

It's the data portability / online identity problem. Many people only want to enter their information once, not for every website. Avatars are one part of that. You upload your avatar to Gravatar once and after that you don't have to do anything but use the same email address everywhere, the rest happens automagically. OpenID tries to solve a similar problem with its Attribute Exchange specification: when you log in with your OpenID the relying party can request information about you from your identity provider. And FOAF is all about that as well. Using the FOAF+SSL protocol you can log in to a Web site with your WebID and the Web site can lookup all sorts of information about you from your FOAF file. Such a file can also contain a link to an avatar.
But the problem with this is that avatars are a design element for Web sites, they can't just display every random image users point them to − they could be of arbitrary dimension and file size. Sure they could shrink them with HTML but they still would have no control over the pageload they impose on their visitors. Downloading the image and cropping it would mean every blog would have to store pictures of every commenter. And the commenter would have no control over how the image gets cropped. Thus the need for a service like Gravatar where you know what you get.

Then there's the problem of choice. Other people don't want to have Gravatars at all and they don't want any identifying information about them published. This is a big problem about OpenID people don't realise by the way. You don't publish your commenter's email address for two reasons: first so they don't get spammed and second so they can't get identified. If you let people comment on your blog with their OpenID and without their email address then the spam problem is gone, OpenIDs can't get spammed. Yet blogs still happily publish the commenter's OpenIDs and thus their identity.

So what would Gravatar done right look like? Every Web site that wants to display avatars would have to sign up with the imaginary Gravatar alternative service. They would enter their domain name and the avatar service would generate a key for them, a shared secret between the site and the avatar service. Now when the site generates the URIs for the avatar images they don't only hash the user's email address but also encrypt it with the key they got (this encryption has to have a certain guarantee of uniqueness and has to be reversible). A browser loading the site would look up this image at the avatar service. This service then has to relate the string in the URI back to the user. A naive approach would be to try every combination of user email address they have and key of a site that registered with them. The number of those combinations is huge though: with 1,000,000 users on the avatar service and 100,000 registered sites you already have 100,000,000,000 combinations to try out (in the worst case). Therefore the avatar service needs to know which Web site this came from so they can use the key they have for that Web site and decrypt the string − after that it's a simple lookup of the hash like Gravatar does it. The information about the origin could either be taken from the Referer HTTP header (but some users turn sending it off in their browsers and it would also not allow hotlinking of the avatar images) or it could be a parameter in the image URI.
What did we gain now? Simple: the URIs for the images are now different for every Web site and thus can't be used for identity smushing anymore (as long as the Web site keeps their key secret), yet you still get the same image for every user that signed up with the avatar service on every Web site. The uploaded images themselves are still relatively unique and can be used for comparing identities but that is the choice of the user who uploaded it (the alternative service would have to make the risks clear to their users). On the other hand if you don't upload an image to the avatar service it would return a generic image which isn't unique.
Of course the images Gravatar returns when it doesn't find an avatar are not always generic. Web sites can ask it to generate an image for the hash then, an identicon or a monsterid. Those are unique then. The alternative service could do the same but it would have to generate the image based on the original encrypted string, not for the email hash that was decrypted from that. The generated image would then be unique for every Web site but you would have a different image on different Web sites. I think that's more interesting anyway. ;-)

There's another problem: distribution. Gravatar is a centralised service and Web sites using it rely on it being the only one so that they get as many hits for their "avatar requests" as possible because everyone registers there instead of somewhere else. If there was to be an alternative service why would Web sites use it over another and still be sure to get many hits? The solution is for the users to tell the Web site which service they prefer. Discovering this user preference would have to be part of an authentication protocol like in OpenID Attribute Exchange or FOAF+SSL − it can't be done based on email addresses anymore.
Apart from that the Web site would have to be registered with that service and know how it works. This could be solved with a common protocol where Web sites can register with an avatar service on the fly (or the service just generates the key for a Web site based on the Web site's domain name so there is no need for registering). The Web site could also ask for the image URIs via this protocol, then there is no need for giving the Web site a key at all (and it would be RESTful) or the protocol specification would tell the Web site how to construct image URIs for any avatar service using it.
This is a complicated matter. Will Web sites do requests on avatar services just for getting image URIs? If not then you rely on constructing URIs and once you defined and deployed a URI structure you have to live with it indefinitely because there is no communication between the Web site and the avatar service and thus no protocol you could version. But this is no different for Gravatar at the moment.

In summary avatar services are a convenience feature and you can improve on Gravatar's concept quite easily. If you want to do it really right then it can get quite complicated and the costs would exceed the benefits. But I still think it's worth a try − and the idea is to move away from using email for authentication anyway. OpenID and FOAF+SSL open up interesting possibilities which can be exploited for the purpose of looking up avatars as well.

Entity tag diff selection

Problem:

Offering large and regularly changing resources like dataset dumps for download can get very inefficient: clients that want the data and want to keep it up-to-date will have to download the whole resource over and over again. Conditional GET can help to check if there have been updates at all but if the resource really changes often then this won't solve the problem.
One solution is to provide a feed that captures changes to the resource. There can be a link with an appropriate link relationship to allow the client to discover this feed. However this solution is not very customised with regard to the client. How does the client know where the changes to the version it downloaded start in the feed? Can all changes ever made be kept in that feed to allow clients with very old copies stay up-to-date as well or will this feed (even if paginated) just become umaintainable?
Another solution is to provide resources that represent diffs between different versions. The client can select the diff its needs. If the main resource is generated from a database then the diff can be generated as well. Otherwise, if it's a static file (e.g. a zip file) then this approach means you have to keep files for all possible diff combinations or at least for diffs from previous versions to the current version − which can become unmaintainable as well.
For RDF datasets it is easy: the large dump can be generated automatically on request (maybe cached) and zipped for transfer. But even if you don't do that on every request but only generate a dump every so often then the diffs can still be generated; they don't have to be real file diffs but could well be just RDF diffs.
So, if you go for diffs, then how do you let the client discover those diffs without previous knowledge about the URL structure (i.e. hypertext-driven)? And how do you make sure the client discovers exactly the diff it needs?

[…]

XForms + RDFa

There have been various attempts in the past to map between web forms and RDF data or − more RESTfully − define accepted inputs of a web service with "RDF Forms".

So, the other day I thought: we have RDFa to markup content in XHTML (or SVG, …) documents as RDF data − but does RDFa always have to be used to tell us about existing data? What if I'd markup a web form with RDFa to declare the meaning certain form fields would have when you use the form to enter data? Of course this has to be used carefully because an RDFa parser would correctly try to get existing data from the form field elements rather than understand that it describes the meaning of potential new data. (And apparently it will ignore the contents of anything entered into form fields.)

But apart from that problem I can see two scenarios where this could be used:
First, a web service client could parse the form and understand the accepted inputs of the web service. Maybe it could even transform this into an RDF Form.
Second, with some scripting you could even use the web form in a browser to POST RDF data to the server. You would have a JavaScript that reacts on the submit event, parses out the RDFa attributes and the current values from the form (maybe using a customised version of the rdfQuery plugins for jQuery) and create output in an RDF format that will get submitted. The problem: regular HTML 4 forms don't let you submit forms in any RDF format. Here you would have to cancel the form submit and manually send your data over XMLHttpRequest. HTML 5 comes with Web Forms 2.0 and has some new ways to submit form data, like an XML format. But since that is a fixed format it doesn't really help us. Discover XForms! From what I've read so far it should be possible to construct arbitrary XML for submission (normally it will do this automatically from the form data but I'm sure a script could customise this process). So with XForms you could just serialise the RDF graph as RDF/XML and you could even use the PUT method to RESTfully update the representation of your resource on the server!

Semantic data submission gleaned from forms! Hooray! Anyone got time coding it? :-)

Nick D'Virgilio drum clinic

Nick D'Virgilio is an awesome drummer. I can tell because I've seen him live yesterday. ;-)

[…]

Tags :

My requirements for purchasing music online

I've been a CD buyer so far for various reasons. First I like albums. I mostly listen to whole albums and to me they are conceptual coherent works which should be enjoyed in one piece. Artwork, lyrics, even the choice of the CD case are part of a design decision, part of the whole work (probably not the case most the time anyway). Apart from that I'm a collector. When I buy something I like to have something physical in my hands (does that make me a capitalist?). And I like to put this thing into my shelf, having a visible collection.

But all that aside there are also reasons why I don't like buying music online because of how music is sold online. Let's first step back and look at what you can actually buy: you buy the right to download a specific track or set of tracks from a specific platform often for a certain number of times only and often with DRM (although that is going away fortunately). I don't know about you but I never liked those conditions. Maybe it is more likely that my CDs start going bad than exhausting the number of allowed downloads being caused by hard drive crashes. But think about what else might happen: you might loose your account data for the shop's website (ok, they can send you that), your account might get hacked, the website might go out of business. Or you happen to be somewhere else on the world and don't have your music collection with you so you just want to quickly download some tunes again.

Doesn't this last point seem quite realistic in today's mobile world? We don't always run around with all of our data these days. Rather the trend seems to be to put it online. For music there are various websites which help you with that. Some let you upload your collection of files and allow you to access it anytime, or even let your friends access it in places where that is legal. Others don't even require you to upload your collection but just scan the files on your drive and provide you with access to the files they have on their drives already.

Let's carry this idea a bit further. What if ownership of music was more decentralised? What if what you bought was a certificate saying that you own a certain piece of music and you could go to any music shopping / streaming / download website (the difference wouldn't matter anymore) to require access to it. We would need services confirming those certificates (and not only the original shop you bought it from in case that goes out of business). And we would need an identifier infrastructure to be clear which piece of music the certificate actually talks about. MusicBrainz could probably provide that. Maybe they need to work a bit more on the level of detail of their data but maybe that doesn't matter anymore with online music because most people don't seem to care about editions and remasters anymore. So which level of abstraction in FRBR or similar models is concerned is something to figure out. But the music industry also didn't really embrace MusicBrainz so far, they prefer re-inventing the wheel and building up their own identifier system. At least that was the plan some time ago, did they give up on it again?

So, how likely is that this happens? Not very likely I think because people would need to agree here and the music industry would probably be scared of the idea of losing control.
Also it seems like they waited too long with changing things anyway and now people prefer to just download songs and not pay for them. But to me it seems like something that suits the people's needs while still being on a totally legal basis. I could still own stuff (which is not the case with flat-rate models) and have convenient access to it wherever and whenever I feel like it.

Tags :

SWIG UK

So today I went to SWIG UK. It had some great talks, some slightly dry talks and lots of interesting people.
The talks which impressed me most were the one by Orri Erling (supported by Yrjänä Rankka) and the one by Leigh Dodds. Not so much content-wise because basically both was a program manager presenting their company's flagship product.
The reason Orri's talk impressed me is because even though he is blind he managed to give his talk in the self-confident manner of a business man and it became clear that this self-confidence is backed by decades of experience in the computer sector. So not only did he not seem too strongly influenced by his disability but, on the contrary, he seemed like the most charismastic person in the room.
Leigh's talk I really liked because of the style. It was a high-level, abstract introduction to the Talis platform but nonetheless he kept it quite entertaining by using clear language, metaphors and comparisons. Apart from that his slides only contained headlines and very few more words, the background being filled with supporting metaphorical images. This is a style I see more and more on Slideshare. It's not very good for understanding what's going on if you didn't attend the talk but people still don't seem to realise that you don't have much time to read all the text on a slide while they talk. You can only concentrate on one thing at a time.

The talks were video recorded and I think the videos as well as the slides will appear on the page of the event on the site of the CREW project − which is also where they were supposed to get annotated live.