Big Data is now TOO BIG - and we're drowning in toxic information
Just why are we hoarding every last binary bit?
By Matt Asay • In Cloud • At 10:00 GMT 4th June 2012
Open ... and Shut
Unless you have found a clever way of avoiding the internet completely, you no doubt have been warned that THERE IS A BIG DATA EXPLOSION! By many accounts, we are currently drowning in information - from log files to stock charts to customer profiles - and face a host of new products cropping up to help us manage the onslaught. Unfortunately, our fixation on hoarding and storing data may actually be making the problem worse, not better.
This is the message of Nassim Taleb's forthcoming book, Antifragile. Taleb made his name with the influential book The Black Swan, and his theory is bound to ruffle some feathers. As he explains in an excerpt from the book:
In business and economic decision-making, data causes severe side effects - data is now plentiful thanks to connectivity; and the share of spuriousness in the data increases as one gets more immersed into it. A not well-discussed property of data: it is toxic in large quantities - even in moderate quantities.
How can this be? We're told at every turn that more data equals better decisions. Yes, we need to parse all that binary to derive "actionable insights", which is the buzzphrase currently making the rounds of every Big Data startup's VC pitch deck. But once you do, ka-BOOM! Your business will immediately have superhero powers.
Except, of course, that it won't.
According to a Gartner survey [PDF], while the volume of corporate data is growing by upwards of 60 per cent each year, the vast majority of respondents (73 per cent) feel their competitors make better use of data than they do. And a mere 17 per cent reveal that they use more than 75 per cent of their data, which suggests most companies collect lots of data and have no clue what to do with them all.
But imagine what will happen when everyone uses data efficiently and to maximum potency: by definition, any competitive advantage will dissipate as all companies (and competitors) become Big Data maestros together. Of course, this will happen at different speeds for different companies, making the race to make sense of corporate data worthwhile.
But it still doesn't tackle Taleb's larger point: the more data we analyse, the more likely our insights from the data will be wrong. Quoting Taleb at length to ensure his point is not lost:
The more frequently you look at data, the more noise you are disproportionally likely to get (rather than the valuable part called the signal); hence the higher the noise to signal ratio. And there is a confusion, that is not psychological at all, but inherent in the data itself.
Say you look at information on a yearly basis, for stock prices or the fertilizer sales of your father-in-law’s factory, or inflation numbers in Vladivostock. Assume further that for what you are observing, at the yearly frequency the ratio of signal to noise is about one to one (say half noise, half signal) — it means that about half of changes are real improvements or degradations, the other half comes from randomness. This ratio is what you get from yearly observations.
But if you look at the very same data on a daily basis, the composition would change to 95 per cent noise, 5 per cent signal. And if you observe data on an hourly basis, as people immersed in the news and markets price variations do, the split becomes 99.5 per cent noise to .5 per cent signal. That is two hundred times more noise than signal — which is why anyone who listens to news (except when very, very significant events take place) is one step below sucker. ...
Now let’s add the psychological to this: we are not made to understand the point, so we overreact emotionally to noise. The best solution is to only look at very large changes in data or conditions, never small ones.
None of which is to suggest that there's no value in Big Data. One of the brand-name companies in Big Data, Cloudera, showcases a range of customer stories that describe ways real companies have derived real value from their data. (Disclosure: Cloudera's CEO is on the board of directors of my company, Nodeable.)
But let's not miss the trees for the forest. As Nick Carr writes, commenting on Taleb's findings: "Because we humans seem to be natural-born signal hunters, we're terrible at regulating our intake of information. We'll consume a ton of noise if we sense we may discover an added ounce of signal. So our instinct is at war with our capacity for making sense."
In other words, the problem isn't the data: it's our ability to know when we have enough data.
We're in the midst of a gold rush, when there's such a fever to collect data that we may be overextending ourselves. One former senior IT executive with one of Silicon Valley's largest web companies acknowledged that his company stores every log file - and does absolutely nothing with them. Never had, and likely never will. Some people suggest the answer is to start deleting this data to keep it manageable and to avoid security breaches. Maybe.
But perhaps a better solution would be to carefully consider which data are likely to be of use, and focus on these data. Yes, this runs the risk of overlooking data that could be useful but may not be immediately recognised as such. Splunk, after all, went public on the premise that log files from machine data are a gold mine for insight into one's business and IT operations, a gold mine that many had previously overlooked.
But we're not currently struggling to collect data. The industry's big need right now is to parse data, and a big part of that surely must be paring down the amount of data we collect in the first place. ®
Matt Asay is senior vice president of business development at Nodeable, offering systems management for managing and analysing cloud-based data. He was formerly SVP of biz dev at HTML5 start-up Strobe and chief operating officer of Ubuntu commercial operation Canonical. With more than a decade spent in open source, Asay served as Alfresco's general manager for the Americas and vice president of business development, and he helped put Novell on its open source track. Asay is an emeritus board member of the Open Source Initiative (OSI). His column, Open...and Shut, appears three times a week on The Register.