I’ll show you mine…

School of bones (dark strokes)
PeeZed got quoted in a Nature article on the new Google Base system, which is way cool for all of us that get to happily bask in his fame and glory. But I don’t think Nature wants to be interviewing me as a “bioinformatician”, either. I’m at best a dilettante in that area and would recommend people like James Foster and Wolfgang Banzhaf as much more on top of it than me.

I didn’t know about this new foray of Google’s, but it does have some interesting possibilities. Some 15 years ago my dissertation research was in the field of Automated Reasoning (computer theorem proving). One of the things that was being discussed way back then was the problem of being able to replicate results. Many papers took the form of “We used our nifty system to prove theorem X” (where presumably this was interesting because X hadn’t been proven automatically before, or this approach had some other nifty properties). The problem was that the programs that carried out these proofs were often very big (mine was a combination of two large systems by other researchers and a bunch of code of my own pasting them together), and machine generated proofs tend to be long and detailed (and less than riveting). Given the typical page limits, the code and the proofs themselves rarely if ever made it into the publications, which means that there’s no way to replicate the results since you can’t re-run the proofs, nor can you check the details of the proof being reported.

There’s a similar issue in my current area of research (Evolutionary Computation - EC). To (exactly) reproduce EC results you’d need the code (which is again big), and to understand the analysis you’d need the details of the populations and individuals over time (which is huge).

One possibility that has been discussed in both communities is actually making it a requirement for publication that authors make their code and the details of their runs available to the reviewers and the research community. A sticking point in both cases was intellectual property concerns, especially for researchers that worked in corporate environments.

An equally important problem, though, is how to make that data available to the community in a broadly available, archival way. Sure, I can hang my code and data off my web site, or the conference organizers can post on their web site, but we all know that these web sites often aren’t reliable in the long-term, and they may not be indexed in a useful way, especially if they’ve been compressed or processed in non-standard ways.

This Google Base idea could look like a really nice solution to those problems if you’re willing to believe that Google represents an archival storage mechanism. And that’s a huge “if”. Google’s a corporation with all the risks that go with that and the attendant obligations to their investors and stockholders. As much as I depend on Google (I must use their service about a zillion times a week), they’re not a public service or a community resource. Thus it would make me nervous to rely on them for this sort of service, even though I suspect that they’d do a really good job of it.

That said, I still think it’s crucial for researchers share their data if we’re going to call it science. Replicability and analysis are key, and for that we often need a lot more information than fits in an eight page conference paper.

I’ll show you mine if you show me yours…

No tag for this post.

Related posts

Leave a Reply