1) Some of you may have seen a piece I wrote in the October 21 Wall Street Journal, on the risks of "big
data" for companies trying to adopt these technologies.
2) Writing for a different audience, the same topic gets a different take:
"A riddle wrapped in a mystery inside an enigma."
That was Winston Churchill speaking
of Russian politics in 1939, but it can also apply with uncanny accuracy
to what the IT industry refers to as "Big Data." On one hand, it's
intuitively obvious that we have more and faster
computers, more sensors, and more data storage (including that "cloud"
business) than ever before. At the same time, few of us can grasp what
astrophysicists, or Facebook software engineers, or biostatisticians
actually
do, so applying Big Data to commerce can be a bit daunting.
To begin with, big data has a nomenclature problem.
Like so many other technologies -- smartphones, robots, or information
security -- the popular name doesn't really convey the essence of the
situation. Yes, "big data" can involve very
large volumes in some cases. But more generally, the phrase refers to
new kinds of data, generated, managed, and parsed in new ways, not merely bigger ones.
While "Big Data" is a vague phrase, there is some agreement that it involves changes in scale along three dimensions:
*Volume: whether
it's your own hard disk space, the world's online video feeds, or a
wealth of digital sensors measuring many aspects of the world, signs are
abundant that data volumes are increasing steadily
and substantially.
*Variety: Big
data is not only a matter of bigger relational databases. As opposed to
the familiar numbers related to customer ID, SKU, or price and quantity,
we are living in an age of massive amounts
of unstructured data: e-mails, Facebook "likes," Tweets, machine
traffic, and video.
*Velocity:
Overnight batch processes are getting to be less and less tenable as the
world becomes an "always-on" information environment. When FedEx can
tell me where my package is, or Fidelity can tell
me my net worth, or Google Analytics can tell me my website performance
right now, the pressure is on more and more other systems to do likewise.
Assuming a business can get past the vocabulary, Big Data presents challenges in many ways:
Skills
Here's a quiz: ask someone in the IT shop how many of his of her colleagues are qualified to work in Hive, Pig, Cassandra, MongoDb, or Hadoop. These are some of the tools that are emerging from the front-runners in Big Data, web-scale companies including Google (that needs to index the entire Internet), Facebook (manage a billion users), Amazon (construct and run the world's biggest online merchant), or Yahoo (figure out what social media is conveying at the macro scale). Outside this small industry, Big Data skills are rare.
Here's a quiz: ask someone in the IT shop how many of his of her colleagues are qualified to work in Hive, Pig, Cassandra, MongoDb, or Hadoop. These are some of the tools that are emerging from the front-runners in Big Data, web-scale companies including Google (that needs to index the entire Internet), Facebook (manage a billion users), Amazon (construct and run the world's biggest online merchant), or Yahoo (figure out what social media is conveying at the macro scale). Outside this small industry, Big Data skills are rare.
Complicating the matter, most of these tools are
open-source projects, which means a) that the code is free, b) that the
pace of innovation is rapid, to the point where staying current is an
issue, and c) that corporate training and support
aren't as robust as they are in the traditional data world. Big data
tools are evolving rapidly, aren't being taught in the universities, and
require levels of flexibility from their users that more mature tools
do not. One telling fact: when non-web companies
post a "Big Data" job specification, oftentimes nobody can state what
the key skill sets are or how they map to the existing team.
Furthermore, knowing Hadoop doesn't mean the skilled IT candidate knows
insurance, or pharmaceuticals, or energy: ground truth
matters in data analysis, so getting skills plus domain knowledge is a
substantial challenge in many cases.
Politics
Control over information is frequently thought to
bring power within an organization. Big data, however, is heterogeneous,
multi-faceted, and can bring performance metrics where they had not
previously operated. If a large retailer, hypothetically
speaking, traced its customers' purchase behavior first to social media
expressions and then to advertising channel, how will the various
budget-holders respond? Uncertainty as to ad spend efficacy is as old as
advertising, but tracing ad channels to purchase
activity might bring light where perhaps it is not wanted. Information
sharing across organizational boundaries ("how are you going to use this
data?") can also be unpopular.
Another political danger lies in the realization
that "what gets measured gets gamed," as we notyed in a recent
newsletter. If a senior executive requests a dashboard including
high-volume, high-velocity metrics such as web clicks or Twitter
mentions, there can be a temptation to abandon revenue-generating
activities that might operate in a 3- or 6-month sales cycle, for the
instant reward of a new metric that ultimately does nothing for either
top or bottom line. Budgeting, meanwhile, is complicated
by the open-source nature of these tools: the software may be free, but
hardware, especially at this large scale and even when procured from
cloud vendors such as IBM or Amazon, behaves oddly when compared to
traditional IT shops. Add in the scarce skills,
and the evolving skills mix, and Big Data can cost more than may
initially be projected.
Risk can also fall under the political heading: who
is willing to stick his neck out to support adoption of technologies
that are both immature and broad-ranging? As more data is gathered, it
can leak or be stolen. Supposedly anonymous
records can often be de-anonymized; in a famous paper, a former
governor of Massachusetts was connected to his own records in a
public-health database. Insufficient privacy is in some ways a math
issue of large spare data sets; there are also also engineering
risks. Here's one: implemented well, the security of large traditional
databases can be very sturdy, but information security in the Big Data
toolset has far to evolve before it can be called robust.
Technique
Given that relational databases have been around
for about 35 years, a substantial body of theory and practice make these
environments predictable. Big Data, by contrast, is just being
invented, but already there are some important differences
between the two:
Most enterprise data is generated by or about
humans and organizations: SKUs are bought by people, bills are paid by
people, health care is provided to people, and so on. At some level,
many human activities can be understood at human scale.
Big data, particularly social media, can come from people too, but in
more and more cases, it comes from machines: server logs, POS scanner
data, security sensors, GPS traces. Given that these new types of data
don't readily fit into relational structures
and can get massively large in terms of storage, it's nontrivial to
figure out what questions to ask of these data types.
When data is loaded into relational systems, it
must fit predefined categories that ensure that what gets put into a
system makes sense when it is pulled out. This process implies that the
system is defined at the outset for what the designers
expect to be queried: the questions are known, more or less, before the
data is entered in a highly structured manner. In Big Data practice,
meanwhile, data is stored in as complete a form as possible, close to
its original state. As little as possible is
thrown out so queries can evolve and not be constrained by the
preconceptions of the system. Thus these systems can look highly random
to traditional database experts.
Given that relational databases have been around
for about 35 years, a substantial body of theory and practice make these
environments predictable. Big Data, by contrast, is just being
invented, but already some important differences are
emerging between the two. It's important to stress that Big Data will
not replace relational databases in most scenarios; it's a matter of now
having more tools to choose from for a given task.
Traditional databases are designed for a concrete
scenario, then populated with examples (customers, products, facilities,
or whatever), usually one per row: the questions and answers one can
ask are to some degree predetermined. Big data
can be harvested in its original form and format, and then analyzed as
the questions emerge. This open-ended flexibility can of course be both a
blessing and a curse.
Traditional databases measured the world in numbers
and letters that had to be predicted: zip codes were 5 or 10 digits,
SKU formats were company-specific, or mortgage payments were of
predictable amounts. Big Data can accommodate Facebook
"likes," instances of the "check engine" light illuminating, cellphone
location mapping, and many other types of information.
Traditional databases are limited by the computing
horsepower available: to ask harder questions often means buying more
hardware. Big Data tools can scale up much more gracefully and
cost-effectively, so decision-makers must become accustomed
to asking questions they could not contemplate previously. To judge
advertising effectiveness, one cable operator analyzed every
channel-surfing click of every remote across every household in its
territory, for example: not long ago, such an investigation
would have been completely impractical.
Both the increasing prevalence of machine data and
the storage of data in near-native form generate major differences in
information technique. Furthermore, the scientific method can play a
more central role given that experimental findings
are collected into less constrained systems for analysis. Web
businesses have done A/B testing for years: 100 random site visitors get
a red banner, a 10% off coupon, or a personalized element while a
control group gets a blue banner, a 15% coupon, or a generic
site greeting. (In Google's case, 41 shades of blue were run through
A/B testing in 2009 before the same color was standardized on both Gmail
and the home page.) Superior performance can be assessed, tweaked, and
retested.
Harrah's Casinos use this method to test marketing
ideas, as does Capital One. Scientific experimentation does not require
Big Data, but Big Data begs in many cases for sophisticated statistical
mechanics. The R programming language, for
example, marks a substantial step beyond even the pivot-table experts
who use Excel. While it's obviously an issue that the skills are in
short supply, the deeper question is one of mindset within businesses
that may use history, intuition, or other methods
to ground decisions. Where is the will, the budget, the insight to
employ Big Data approaches?
Cognition
What does it mean to think at scale? How do we
learn to ask questions of the transmission of every car on the road in a
metropolitan area, of the smartphone of every customer of a large
retail chain, or of every overnight parcel in a massive
distribution center? How can more and more businesspeople learn to
think probabilistically rather than anecdotally?
The mantra that "correlation doesn't
imply causation" is widely chanted yet frequently ignored; it takes
logical reasoning beyond statistical relationships to test what's really
going on. Unless the data team can grasp the
basic relationships of how a given business works, the potential for
complex numerical processing to generate false conclusions is
ever-present.
Numbers do not speak for themselves; it takes a human to tell
stories, but as Daniel Kahneman and others have shown, our stories often
embed mental traps.
Spreadsheets remain ubiquitous in the modern
enterprise but numbers at the scale of Google, Facebook, or Amazon must
be conveyed in other ways. Sonification -- turning numbers into a range
of audible tones -- and visualization show a lot
of promise as alternative pathways to the brain, bypassing mere and
non-intuitive numerals. In the meantime, the pioneers are both seeing
the trail ahead and taking some arrows in the back for their troubles.
But the faster managers begin to break the stereotype
that "Big Data is what we've always done, just with more records or
fields," the faster the breakthrough questions, insights, and solutions
will redefine business practice.
Summing Up
Not surprisingly, the principles of good management
extend to the domain of Big Data even though the hype and rapid pace of
change can create confusion. Before businesses can profit from Big
Data, managers must refuse to get lost in the
noise that can obscure the basic forces represented by customers,
value, and execution. The scale, speed, and diversity of Big Data can
make it feel foreign, but by both refusing to be dazzled by numerical
tsunamis and insisting on the basics of sound analytical
practice (particularly in regard to causal relationships), any
executive can contribute to the next generation of performance-enhancing
information practice.