I drove to work today. Normally, I take the train, but for several stupid reasons, I decided to drive.
Atlanta drivers have a disdain for "rules" and "traffic laws" that make me want to scream. They will pass you in an exit-only lane only to hold up that lane when they try to merge back into traffic 3 cars ahead. Four or five cars will plow through the intersection after the light has changed. The turn signal is apparently a sign of weakness.
So, because some fathead wants to get 6 car lengths ahead, we all sit and suffer in some of the worst traffic in the country. By actually following the rules, you:
- Are forced to sit in the traffic being caused by those that break the rules
- Are probably more of a liability on the road because if everyone is doing "wrong" things, you are the unpredictable one by being different.
Which brings me to metasearch (and a jarring segue). I am currently
not sitting in a meeting in Macon to decide which metasearch product the state is going to go with. And, really, it doesn't matter that much.
Although I won't name any names, the candidates were down to two choices:
- A "traditional" metasearch that uses standards like Z39.50, SRW/U, etc. to search
- A metasearch that is based on screen scraping
While the research libraries in the state were leaning towards #1, there was something gnawing at us that was hard to deny.
#2 really mopped the floor with #1 as a federated search engine. Not only was it
exponentially faster, but it is capable of searching over 95% of our databases (as opposed to #1 which is in the 30-40% range... and does that slowly).
Still, there is other functionality in #1 that still makes it desirable (mostly revolving around workflow and integration into an academic environment, integration with our link resolvers). As a cross database searcher, however, #2 is clearly the winner.
What this brings me to is... How did we get to this point? Why is it actually so much easier and brings better results when we "break the rules"? We have invested a lot of time, thought and energy into creating our standards... how can it possibly be easier to
screen scrape results pages rather than use the tools we have created?
I blame libraries first. Securing access via Z39.50, XML gateway, API, etc. has never been a particularly high priority. Metasearch is not only a "systems" issue. It also needs to be looked upon as a collection development issue. If two vendors have "Compendex" and only one of them makes it available through means outside the native web interface, unless that vendor's native web interface is "the suck" (technical term), they really should be considered the more desirable option. Along with a whole host of other factors, of course. Still, I think non-native access is a very low priority among collection development decisions.
I blame the vendors next. First of all, so many of them don't even offer some sort of alternative access. Secondly, if they do, it's an afterthought.
I have been toying with robcaSSon's federated search project, unhelpfully supplying suggestions when he asks #code4lib for help on particular problems. What Rob has written so far is very cool (but unfinished and therefore not publically available) but it struck me how slowly it searched Academic Search Premier and Business Source Premier (that's Rob's "canned" query -- those two dbs with the keyword search "hamid karzai and heroin").
In the native interface, searching across those two dbs is nearly instantaneous... it's basically just waiting for the browser to render the tables that takes any time. In Rob's interface, it takes about 5+ seconds to do the same search (and this is with no load on Rob's end, since it's not in production... so real world performance would probably be lower). Now, as we learned from Metasearch product #1, this is sadly respectable in the metasearch arena. It's still bad, though, and I wanted to figure out why it took so much longer.
Using
indexdata's handy yaz-client, I fired up a Z39.50 session to EBSCO's Z39.50 server to investigate. Searching "hamid karzai and heroin" took a little over 4 seconds. Hmm. 4 seconds?! So I did a search for "female genital mutilation". 0.2 seconds. Hmm. I did the original search again. 0.05 seconds. Wow. I exited out of yaz-client and then reopened the connection and did it all again. Basically the same thing.
So, apparently it's the first search in a session that's a problem. And that sucks. Inherently, every search in a metasearch is the first search in the session. Certainly some connections can be cached, but this definitely raises the complexity of the application and, no matter what, not everything can be cached all the time.
Now,
yazproxy would be perfect for dealing with this. It could maintain the session information and at the same time transform the output to xml. Everybody wins! Well, except I can't get it to work. I guess that's a bit of a hindrance...
So, again, by trying to do right and follow the standards our community has set, we are left behind the sloppy, inexact searching of a screen scraping method. Ultimately, we all lose, though, because screen scraping can only go so far. The richness of services we can layer upon a screen scaper has far less depth than that of a structured search.
And laying on the horn doesn't really help...