There’s an old saying in the data management world: garbage in, garbage out, or GIGO. It means that the results of any data analysis project are only as good as the quality of the data being analyzed. Data quality is of critical importance when data sets are relatively small and structured. If you only have a small sample of data on which to perform your analysis, it better be good data. Otherwise, the resulting insights aren’t insights at all.
But does GIGO apply in Big Data scenarios? From a purely practical standpoint, is it realistic to cleanse and scrub data sets that reach the hundreds of terabytes to petabytes level to the same degree as smaller data sets? And since most Big Data lacks traditional structure, just what does data quality look like?
Consider that many data quality issues in smaller, structured data sets are man-made. A call center representative inputs the wrong digits. A customer selects the wrong option from a drop-down menu. These errors can be fixed fairly easily, if they’re caught. But most data in Big Data scenarios is machine-generated, such as log files, GPS data or click-through data. If a piece of industrial equipment starts stamping streaming log-files with incorrect dates and times, for example, the problem will quickly multiply. And retroactively applying the correct date and time to each log-file – little more than just strings of digits and dashes – will be a daunting, if not futile, task.
Further, Big Data evangelists maintain that the sheer volume of data in Big Data scenarios mitigate the effects of occasional poor data quality. If you’re exploring petabytes of data to identify historical trends, a few data input errors will barely register as a blip on a dashboard or report. Is it even worth the time and effort, then, to apply data quality measures in such a scenario? Probably not.
But that doesn’t mean data quality isn’t important to Big Data. This is particularly true in real-time transactional scenarios. Big Data applications that recommend medicines and doses for critically ill patients, for one, better be relying on good data. Same goes for Big Data operational applications that support commercial aviation, the power grid and other Industrial Internet use cases.
There are no easy answers to these questions, but clearly it’s important for practitioners to understand the source and structure of the data in question, as well as the data quality requirements for given Big Data use cases, in order to determine the level and type of data quality tools/measures to apply. There’s also the human element to consider. Someone needs to “own” data quality for Big Data projects, otherwise it can be easily overlooked.
We will be exploring these and other topics, including the role of the Chief Data Officer in the enterprise, all day tomorrow at the 2013 MIT CDOIQ Symposium in Cambridge, Mass. Tune in at Wikibon.org/blog orSiliconANGLE.com starting at 10:30 am EST, and join the conversation on Twitter, hashtags #theCUBE and #MITIQ.