Dilettante's Ball: The view from the moral high ground reveals that I've fallen behind

Wednesday, June 01, 2005 The view from the moral high ground reveals that I've fallen behind

I drove to work today. Normally, I take the train, but for several stupid reasons, I decided to drive.

Atlanta drivers have a disdain for "rules" and "traffic laws" that make me want to scream. They will pass you in an exit-only lane only to hold up that lane when they try to merge back into traffic 3 cars ahead. Four or five cars will plow through the intersection after the light has changed. The turn signal is apparently a sign of weakness.

So, because some fathead wants to get 6 car lengths ahead, we all sit and suffer in some of the worst traffic in the country. By actually following the rules, you:

Are forced to sit in the traffic being caused by those that break the rules
Are probably more of a liability on the road because if everyone is doing "wrong" things, you are the unpredictable one by being different.

Which brings me to metasearch (and a jarring segue). I am currently not sitting in a meeting in Macon to decide which metasearch product the state is going to go with. And, really, it doesn't matter that much.

Although I won't name any names, the candidates were down to two choices:

A "traditional" metasearch that uses standards like Z39.50, SRW/U, etc. to search
A metasearch that is based on screen scraping

While the research libraries in the state were leaning towards #1, there was something gnawing at us that was hard to deny.

#2 really mopped the floor with #1 as a federated search engine. Not only was it exponentially faster, but it is capable of searching over 95% of our databases (as opposed to #1 which is in the 30-40% range... and does that slowly).

Still, there is other functionality in #1 that still makes it desirable (mostly revolving around workflow and integration into an academic environment, integration with our link resolvers). As a cross database searcher, however, #2 is clearly the winner.

What this brings me to is... How did we get to this point? Why is it actually so much easier and brings better results when we "break the rules"? We have invested a lot of time, thought and energy into creating our standards... how can it possibly be easier to screen scrape results pages rather than use the tools we have created?

I blame libraries first. Securing access via Z39.50, XML gateway, API, etc. has never been a particularly high priority. Metasearch is not only a "systems" issue. It also needs to be looked upon as a collection development issue. If two vendors have "Compendex" and only one of them makes it available through means outside the native web interface, unless that vendor's native web interface is "the suck" (technical term), they really should be considered the more desirable option. Along with a whole host of other factors, of course. Still, I think non-native access is a very low priority among collection development decisions.

I blame the vendors next. First of all, so many of them don't even offer some sort of alternative access. Secondly, if they do, it's an afterthought.

I have been toying with robcaSSon's federated search project, unhelpfully supplying suggestions when he asks #code4lib for help on particular problems. What Rob has written so far is very cool (but unfinished and therefore not publically available) but it struck me how slowly it searched Academic Search Premier and Business Source Premier (that's Rob's "canned" query -- those two dbs with the keyword search "hamid karzai and heroin").

In the native interface, searching across those two dbs is nearly instantaneous... it's basically just waiting for the browser to render the tables that takes any time. In Rob's interface, it takes about 5+ seconds to do the same search (and this is with no load on Rob's end, since it's not in production... so real world performance would probably be lower). Now, as we learned from Metasearch product #1, this is sadly respectable in the metasearch arena. It's still bad, though, and I wanted to figure out why it took so much longer.

Using indexdata's handy yaz-client, I fired up a Z39.50 session to EBSCO's Z39.50 server to investigate. Searching "hamid karzai and heroin" took a little over 4 seconds. Hmm. 4 seconds?! So I did a search for "female genital mutilation". 0.2 seconds. Hmm. I did the original search again. 0.05 seconds. Wow. I exited out of yaz-client and then reopened the connection and did it all again. Basically the same thing.

So, apparently it's the first search in a session that's a problem. And that sucks. Inherently, every search in a metasearch is the first search in the session. Certainly some connections can be cached, but this definitely raises the complexity of the application and, no matter what, not everything can be cached all the time.

Now, yazproxy would be perfect for dealing with this. It could maintain the session information and at the same time transform the output to xml. Everybody wins! Well, except I can't get it to work. I guess that's a bit of a hindrance...

So, again, by trying to do right and follow the standards our community has set, we are left behind the sloppy, inexact searching of a screen scraping method. Ultimately, we all lose, though, because screen scraping can only go so far. The richness of services we can layer upon a screen scaper has far less depth than that of a structured search.

And laying on the horn doesn't really help...

8 Comments:

At 2:35 PM, June 01, 2005, Anonymous said...: Good old SiteSearch did a great job of pooling Z39.50 sessions. I think any metasearch application is going to have to do that; it's the nature of Z39.50. The Z39.50 toolkits make this fairly easy, I think.
At 5:38 PM, June 01, 2005, Ross said...: Ah, good old, complicated, poorly-documented SiteSearch, yes, I knew you well.

I mean as well as I could, given how poorly documented it was.

So, I just tried this in OpenSiteSearch (reopening old wounds in the process) and there exists the same problem. Again, only with the first search in a session. I did the same searches. "hamid karzai and heroin" took 10 seconds for the first search in OpenSiteSearch. "Female genital mutilation" took somewhere between 2 and 3 seconds after that.

The connection pooling only seems to exist after the session has been started. But, in reflection, you're right... the Yaz toolkit will do this... just after the user is already turned off by the speed.
At 10:51 PM, June 01, 2005, noizy said...: argh, all this geeky library talk! I'm coming over all giddy...
At 2:21 PM, June 02, 2005, Anonymous said...: You could configure SiteSearch to open a connection to certain targets (or was it just one?) when you start a session, so the Z39.50 session start-up overhead would be happening while you were entering your first search. It's in the manual somewhere.
At 2:22 PM, June 02, 2005, Anonymous said...: Sorry about the geekery, by the way. Jeeves? The smelling salts for Natalie, please.
At 2:57 PM, June 02, 2005, Ross said...: It's in the manual somewhere.
You're kidding, right?

Besides, the problem doesn't seem to revolve around the connection. It definitely seems to be the first search. Maybe through some AJAX-ery, you could do some sort of scan request when the problem targets are selected.

And never apologize about library geekery. Without it we'd be living under a bridge somewhere subsisting on dog food.
At 6:41 PM, June 02, 2005, noizy said...: indeed, don't be sorry about the geekery aspect. I love it. I was coming over all giddy in a good way. ;)
At 4:33 PM, July 27, 2005, Anonymous said...: bored at the reference desk, reading ross's old posts...

i've got the yazproxy solution working....adds one more layer that i'm loathe to do, but as a POC (proof-of-concept, not piece-of-crap) it works...

and as for releasing my federated searcher, i'm close...just need to get thru a couple more projects, and get back to it......

Community:

8 Comments: