Posts

Does Big Search change science?

Matthew Ingram at the Globe and Mail takes Wired to task over a recent article that implies Big Search will save the world and change the way we solve problems. I think Matthew’s right, and that the way we can now mine vast amounts of data isn’t a substitute for science — merely an accelerant.

I think this for two reasons. First, Google can’t solve the problems with machines; and second, correlation is not causality.

I was at the Googleplex last week and saw signs on the walls urging employees to report spammy pages. They had some sort of a spam bounty, in which Google staffers could point the finger at sites that were simply reposting other information or relying on spam.

You’d think that computers could find these things. But this isn’t the case: Apparently, there aren’t algorithms that can work well enough to make humans unnecessary yet.

They’ll get smarter, of course. Google-scale data hooked up to some of the latest advancements in neuroscience may yet pass a Turing Test. And if it doesn’t, but it’s still smarter and better, it won’t care: It’ll just redefine the rules of the test and make itself some friends. So maybe Google will fix this one and make machines smart enough.

But the real flaw in the Wired argument is that correlation does not mean causality.

When solving problems, big data will tell us where to look. This can be a huge convenience, eliminating — for example — vast numbers of potential drugs to zero in on a cure. But the cure still has to be proven through causality.

To quote from a recent rant in e-mail (proving once and for all that my e-mails are boring and tedious):

Mistaking coincidence for correlation, and then invoking higher powers to find causality, is bad thinking.

We have a mental test for this one, thanks to Shakespeare’s Rosencrantz and Guildenstern: Ask any person how likely it is that a coin will land face-up ten times in a row. “Vastly unlikely!” they’ll say. “It must be a miracle!”

Of course, it’s exactly as likely as any other combination. The problem is the human brain wants to find patterns in there, and ten-heads-in-a-row is an easy pattern to remember. So it seems improbable, when in fact it’s just as probable as harder-to-remember sequences. This is pure observer bias. We do it to other things, particularly noticeable ones (stubbing our toes, winning at a raffle.)

The classic example a great stats teacher of mine used once was that you’re more likely to drown when you eat ice cream. This is nonsense, of course: There’s a third factor, summertime, that increases both drowning and ice cream consumption. Google can point us at the correlation, but it’s up to us to ferret out the causality.

The risk here is that good science is about disproving things. You have a hypothesis, and you set out to prove it wrong. Big Data can give us hypotheses faster, and they have a greater likelihood of being true. But science has to be more vigilant than ever with those hypotheses in order not to inadvertently accept coincidences as fact.

You can leave a response, or trackback from your own site.
Powered by WordPress, based on Mina theme.