Friday, October 25, 2013

October 2013 Early Indications: How do companies "do" Big Data?

1) Some of you may have seen a piece I wrote in the October 21 Wall Street Journal, on the risks of "big data" for companies trying to adopt these technologies.

2) Writing for a different audience, the same topic gets a different take:

"A riddle wrapped in a mystery inside an enigma."

That was Winston Churchill speaking of Russian politics in 1939, but it can also apply with uncanny accuracy to what the IT industry refers to as "Big Data." On one hand, it's intuitively obvious that we have more and faster computers, more sensors, and more data storage (including that "cloud" business) than ever before. At the same time, few of us can grasp what astrophysicists, or Facebook software engineers, or biostatisticians actually do, so applying Big Data to commerce can be a bit daunting.

To begin with, big data has a nomenclature problem. Like so many other technologies -- smartphones, robots, or information security -- the popular name doesn't really convey the essence of the situation. Yes, "big data" can involve very large volumes in some cases. But more generally, the phrase refers to new kinds of data, generated, managed, and parsed in new ways, not merely bigger ones.

While "Big Data" is a vague phrase, there is some agreement that it involves changes in scale along three dimensions:
            *Volume: whether it's your own hard disk space, the world's online video feeds, or a wealth of digital sensors measuring many aspects of the world, signs are abundant that data volumes are increasing steadily and substantially.
            *Variety: Big data is not only a matter of bigger relational databases. As opposed to the familiar numbers related to customer ID, SKU, or price and quantity, we are living in an age of massive amounts of unstructured data: e-mails, Facebook "likes," Tweets, machine traffic, and video.
            *Velocity: Overnight batch processes are getting to be less and less tenable as the world becomes an "always-on" information environment. When FedEx can tell me where my package is, or Fidelity can tell me my net worth, or Google Analytics can tell me my website performance right now, the pressure is on more and more other systems to do likewise.

Assuming a business can get past the vocabulary, Big Data presents challenges in many ways:

Here's a quiz: ask someone in the IT shop how many of his of her colleagues are qualified to work in Hive, Pig, Cassandra, MongoDb, or Hadoop.  These are some of the tools that are emerging from the front-runners in Big Data, web-scale companies including Google (that needs to index the entire Internet), Facebook (manage a billion users), Amazon (construct and run the world's biggest online merchant), or Yahoo (figure out what social media is conveying at the macro scale).  Outside this small industry, Big Data skills are rare.

Complicating the matter, most of these tools are open-source projects, which means a) that the code is free, b) that the pace of innovation is rapid, to the point where staying current is an issue, and c) that corporate training and support aren't as robust as they are in the traditional data world. Big data tools are evolving rapidly, aren't being taught in the universities, and require levels of flexibility from their users that more mature tools do not. One telling fact: when non-web companies post a "Big Data" job specification, oftentimes nobody can state what the key skill sets are or how they map to the existing team. Furthermore, knowing Hadoop doesn't mean the skilled IT candidate knows insurance, or pharmaceuticals, or energy: ground truth matters in data analysis, so getting skills plus domain knowledge is a substantial challenge in many cases.

Control over information is frequently thought to bring power within an organization. Big data, however, is heterogeneous, multi-faceted, and can bring performance metrics where they had not previously operated. If a large retailer, hypothetically speaking, traced its customers' purchase behavior first to social media expressions and then to advertising channel, how will the various budget-holders respond? Uncertainty as to ad spend efficacy is as old as advertising, but tracing ad channels to purchase activity might bring light where perhaps it is not wanted. Information sharing across organizational boundaries ("how are you going to use this data?") can also be unpopular.

Another political danger lies in the realization that "what gets measured gets gamed," as we notyed in a recent newsletter. If a senior executive requests a dashboard including high-volume, high-velocity metrics such as web clicks or Twitter mentions, there can be a temptation to abandon revenue-generating activities that might operate in a 3- or 6-month sales cycle, for the instant reward of a new metric that ultimately does nothing for either top or bottom line. Budgeting, meanwhile, is complicated by the open-source nature of these tools: the software may be free, but hardware, especially at this large scale and even when procured from cloud vendors such as IBM or Amazon, behaves oddly when compared to traditional IT shops. Add in the scarce skills, and the evolving skills mix, and Big Data can cost more than may initially be projected.

Risk can also fall under the political heading: who is willing to stick his neck out to support adoption of technologies that are both immature and broad-ranging? As more data is gathered, it can leak or be stolen. Supposedly anonymous records can often be de-anonymized; in a famous paper, a former governor of Massachusetts was connected to his own records in a public-health database. Insufficient privacy is in some ways a math issue of large spare data sets; there are also also engineering risks. Here's one: implemented well, the security of large traditional databases can be very sturdy, but information security in the Big Data toolset has far to evolve before it can be called robust.

Given that relational databases have been around for about 35 years, a substantial body of theory and practice make these environments predictable. Big Data, by contrast, is just being invented, but already there are some important differences between the two:

Most enterprise data is generated by or about humans and organizations: SKUs are bought by people, bills are paid by people, health care is provided to people, and so on. At some level, many human activities can be understood at human scale. Big data, particularly social media, can come from people too, but in more and more cases, it comes from machines: server logs, POS scanner data, security sensors, GPS traces. Given that these new types of data don't readily fit into relational structures and can get massively large in terms of storage, it's nontrivial to figure out what questions to ask of these data types.

When data is loaded into relational systems, it must fit predefined categories that ensure that what gets put into a system makes sense when it is pulled out. This process implies that the system is defined at the outset for what the designers expect to be queried: the questions are known, more or less, before the data is entered in a highly structured manner. In Big Data practice, meanwhile, data is stored in as complete a form as possible, close to its original state. As little as possible is thrown out so queries can evolve and not be constrained by the preconceptions of the system. Thus these systems can look highly random to traditional database experts.

Given that relational databases have been around for about 35 years, a substantial body of theory and practice make these environments predictable. Big Data, by contrast, is just being invented, but already some important differences are emerging between the two. It's important to stress that Big Data will not replace relational databases in most scenarios; it's a matter of now having more tools to choose from for a given task.

Traditional databases are designed for a concrete scenario, then populated with examples (customers, products, facilities, or whatever), usually one per row: the questions and answers one can ask are to some degree predetermined. Big data can be harvested in its original form and format, and then analyzed as the questions emerge. This open-ended flexibility can of course be both a blessing and a curse.

Traditional databases measured the world in numbers and letters that had to be predicted: zip codes were 5 or 10 digits, SKU formats were company-specific, or mortgage payments were of predictable amounts. Big Data can accommodate Facebook "likes," instances of the "check engine" light illuminating, cellphone location mapping, and many other types of information.

Traditional databases are limited by the computing horsepower available: to ask harder questions often means buying more hardware. Big Data tools can scale up much more gracefully and cost-effectively, so decision-makers must become accustomed to asking questions they could not contemplate previously. To judge advertising effectiveness, one cable operator analyzed every channel-surfing click of every remote across every household in its territory, for example: not long ago, such an investigation would have been completely impractical.

Both the increasing prevalence of machine data and the storage of data in near-native form generate major differences in information technique. Furthermore, the scientific method can play a more central role given that experimental findings are collected into less constrained systems for analysis. Web businesses have done A/B testing for years: 100 random site visitors get a red banner, a 10% off coupon, or a personalized element while a control group gets a blue banner, a 15% coupon, or a generic site greeting. (In Google's case, 41 shades of blue were run through A/B testing in 2009 before the same color was standardized on both Gmail and the home page.) Superior performance can be assessed, tweaked, and retested.

Harrah's Casinos use this method to test marketing ideas, as does Capital One. Scientific experimentation does not require Big Data, but Big Data begs in many cases for sophisticated statistical mechanics. The R programming language, for example, marks a substantial step beyond even the pivot-table experts who use Excel. While it's obviously an issue that the skills are in short supply, the deeper question is one of mindset within businesses that may use history, intuition, or other methods to ground decisions. Where is the will, the budget, the insight to employ Big Data approaches?

What does it mean to think at scale? How do we learn to ask questions of the transmission of every car on the road in a metropolitan area, of the smartphone of every customer of a large retail chain, or of every overnight parcel in a massive distribution center? How can more and more businesspeople learn to think probabilistically rather than anecdotally?

The mantra that "correlation doesn't imply causation" is widely chanted yet frequently ignored; it takes logical reasoning beyond statistical relationships to test what's really going on. Unless the data team can grasp the basic relationships of how a given business works, the potential for complex numerical processing to generate false conclusions is ever-present. Numbers do not speak for themselves; it takes a human to tell stories, but as Daniel Kahneman and others have shown, our stories often embed mental traps.

Spreadsheets remain ubiquitous in the modern enterprise but numbers at the scale of Google, Facebook, or Amazon must be conveyed in other ways. Sonification -- turning numbers into a range of audible tones -- and visualization show a lot of promise as alternative pathways to the brain, bypassing mere and non-intuitive numerals. In the meantime, the pioneers are both seeing the trail ahead and taking some arrows in the back for their troubles. But the faster managers begin to break the stereotype that "Big Data is what we've always done, just with more records or fields," the faster the breakthrough questions, insights, and solutions will redefine business practice.

Summing Up
Not surprisingly, the principles of good management extend to the domain of Big Data even though the hype and rapid pace of change can create confusion. Before businesses can profit from Big Data, managers must refuse to get lost in the noise that can obscure the basic forces represented by customers, value, and execution. The scale, speed, and diversity of Big Data can make it feel foreign, but by both refusing to be dazzled by numerical tsunamis and insisting on the basics of sound analytical practice (particularly in regard to causal relationships), any executive can contribute to the next generation of performance-enhancing information practice.