Dilettante's Ball: Gussying up OpenSearch

Wednesday, June 15, 2005 Gussying up OpenSearch

So, last week, before I left for SMUG, Mike Rylander (of Evergreen-ILS), Joshua Ferraro (of Liblime/Koha) and I began talking about OpenSearch interfaces for our respective catalogs. The only reason I was able to really contribute to the conversation was the fact that I had my little python CGI, but I hadn't thought much about it since I wrote it.

I had largely given up on targeted searching within OpenSearch due to the fact that targeted searching would be wasted on A9. A9 is basically only keyword and phrase searching, so more sophisticated queries would really only produce results in the catalog column. You would remove any advantage of A9's cross searching plus you'd remove any advantage of the catalog's native interface. Since I hadn't seen anyone else implementing OpenSearch, there was no point in pursuing this.

Well, until last week. Mike was talking about implementing OpenSearch in Evergreen and was interested in including results from Koha catalogs. Since I had already created my OpenSearch widget, it also seemed like a natural target. More natural, in fact, since the advantages of including the Georgia academic libraries in a search of Georgia public libraries (and vice versa) seemed so obvious.

This meant that targeted searching needed to work, though. I made the suggestion that queries should assume keyword and respond to CQL if it is supplied. CQL is just so intuitive that it seems silly not to use it, plus the added bonus of not having to manipulate it in any way before I send requests to yazproxy. Always looking for ways to get out of doing work, you see. So with that statement, I packed up and headed to SMUG and its lack of internet access for the rest of the week.

When I got back, not only had Mike created a proof-of-concept search interface, he had also created an extension to OpenSearch to account for relevancy and merged sets. He proposes adding the namespace xmlns:openIll="http://open-ils.org/xml/openIll/1.0" and using an <openill:relevance> tag to display relevance. With this, result sets can be merged and sorted by relevance, and any search targets that don't include this will appear as columns like normal.

There are some issues with this method, the most obvious being "what is relevance?" For example, I have no abilities to sort or get relevancy rankings from Voyager's Z39.50 server. In order to make Mike's proof-of-concept work, I had to "fake" a relevancy ranking based on order (which is always reverse chronological by creation date). I take the remainder of 100 minus ((the quotient of 100 divided by the number of results) and multiplied by the (result number minus 1)) (wow, refreshing my arithmetic vocabulary). My point here is, that's a crappy algorithm. For queries that produce thousands of results, you'll wind up with upwards of 50 hits with greater than 98% relevancy, despite the fact that the second result may not be relevant at all. Of course, I'd never put this in a production system, but it still goes to prove that relevancy is relative. The other problem is that in a merged search of two or more OpenSearch targets, the results on page three of a given target may be considerably more relevant than, say, the third result of the other target, yet you will still get all of the less relevant results in the pages in between. While, this is possibly a valid argument, it sort of misses the point of OpenSearch which I view merely as a resource exposure tool rather than a robust search protocol (at this stage, anyway).

The good news is that when Art and I have our OPAC mirror set up, we will actually be able to do relevance properly. The even better news is that we don't even have to have a concept of how the public interface is going to work to get this functionality. We'll have the data, we'll have the indexer, we just need to point Z39.50 queries to it and then direct them to the real OPAC (for now).

One thing I want to point out, though, is how cool the Evergreen and Koha results are. Not only are they deduping (we'll be able to do that later), but they display a sort of brief holdings (x copies available) and provide links on author and subject. I wasn't even thinking about providing subjects! I will have to see how easy it is to add this sort of functionality to our search results.

I hope to be able to point our OpenSearcher at our DSpace repository soon, too. We're currently trying to install the OCLC SRW web app and this could go a long way in providing exposure to DSpace from other resources like WAG the Dog or our catalog.

2 Comments:

At 1:36 PM, June 15, 2005, Ross said...: Er, all those words and I forgot to mention that targeted searching using CQL works. Also, using - works as a boolean NOT and quoted text performs a phrase search.
At 5:15 PM, June 16, 2005, Ross said...: Richard, these are good points. Our OpenSearch widget defaults to keyword anywhere if no CQL is passed. This would always make it "A9-aware", but if libraries (or whoever) were to implement more sophisticated clients, it's ready for that.

I like the idea of including a "targeted search syntax" in the description document. I would like even more for the library community to decide on a single syntax so the client doesn't have to do a lot of translating between targets. Of course, I've already cast my vote for CQL here.

I think I left out a piece about relevance that might clear some up.

Mike suggest a <openIll:relevanceScale/> tag to make some sense of the actual relevance tags that are being sent.

Again, I really only see relevance being used with known and trusted targets. The occasions where merged sets are useful would only occur if the client has already identified the targets are similar in scope or audience. I am not sure I'd ever want to see A9, for example, using it.

Community:

2 Comments: