Thursday, June 21, 2007

May 2007 Early Indications

The following is based on the opening talk presented at the Center for
Digital Transformation's spring 2007 research forum.

Roughly 20 years ago, Citibank CEO Walter Wriston said that
"information about money has become almost as important as money
itself." Since that time, complex secondary and tertiary risk markets
have grown into a massive global financial information-processing
mechanism. Stocks and bonds, traded on primary markets, are hedged by
futures, options, and derivatives, as well as a variety of arcane (to
the public) devices such as Enron's famous special purpose entities.
These instruments are nothing more than information about money, and
their growth helps prove the truth and wisdom of Wriston's comment.

Data, what Stan Davis once called "information exhaust" or the
byproduct of traditional business transactions, has become a means of
exchange and a store of value in its own right. Hundreds or even
thousands of business plans are circulating, each promising to
"monetize data." While Google is an obvious poster child for this
trend, there are many other, often less obvious, business models
premised on Wriston's core insight, that information about stuff is
often more valuable and/or profitable than the stuff.

Internet businesses are the first that come to mind. Both Linux and
eBay have captured reputational currency and developed communities
premised on members' skills, trustworthiness, and other attributes.
These attributes are, in the case of eBay, highly codified and make
the business much more than a glorified classified ad section.
Information about retail goods is used by 7-Eleven Japan to drive
new-product hypotheses in much the same way than analytical credit
card operations such as Capital One develop offers in silico. An
astounding 70% of SKUs in a 7-Eleven are new in a given year, and such
innovation in a seemingly constrained market is only possible because
of effective use of data.

Amazon's use of purchase and browsing data remains unsurpassed. I
recently compared a generic public page -- "welcome guest!" -- to my
home page, and at least eighteen different elements were customized
for me. These were both "more of the same," continuing a trend begun
with a previous author or recording artist purchase, and "we thought
you might like," recommendations based on customer behavior of others
deemed similar to me. Of the eighteen elements of that home page,
each had a valid reason for inclusion and was a plausible purchase.

Another less visible example of this trend is the Pantone system.
Information about color is almost certainly more profitable than paint
or ink. Pantone has a monopoly on the precise defitions for colors
used in commerce, whether in advertising or branding - Barbi pink and
Gap blue are omnipresent - or in production processes: every brownie
baked for use in Ben & Jerry's ice cream is compared to two Pantone
browns to ensure consistency. Pantone is also global: Gap blue is the
same in Japan as in New Jersey, and on shopping bags, neon signs, and
printed materials. The private company does not disclose revenues,
but it is now branching out into prediction businesses, selling
briefings telling fashion, furniture, and other
companies whether olive green will be popular or not next year.

A second trend crossing business, science, and other fields can
colloquially be called "big data." We are seeing the growth of truly
enormous data stores, which can facilitate both business decisions and
analytic insights for other purposes.

Some examples:

-The Netflix prize invites members of the machine learning community
to improve the
prediction algorithms used to recommend "if you liked X you might like
Y" recommendations. While it is not clear that the performance
benchmark needed to win the $1 million top prize can be reached
incrementally, one major attractor for computer scientists is the size
and richness of Netflix's test data set, the likes of which are scarce
in the public domain: it consists of more than 100 million ratings
from over 480 thousand randomly-chosen, anonymous customers on nearly
18 thousand movie titles.

-Earlier this month a new effort, the Encyclopedia of Life, was
launched to provide an online catalog of every species on earth. In
the past several years, however, geneticist Craig Venter sailed around
the world on a boat equipped with gene sequencing gear. The wealth of
the results is staggering: at least six million new genes were

-The data available on a Bloomberg terminal allows complex inquiries
across asset classes, financial markets, and time to be completed
instantaneously. Before this tool, imagine answering a simple
question using spreadsheets, paper records, multiple currencies, and
optimization: "What basket of six currencies - three short and three
long - delivered the best performance over the past two years?"

-The Church of Latter-day Saints has gathered genealogical records
into an online repository. The International Genealogical Index
database contains approximately 600 million names of deceased
individuals, while the addendum to the International Genealogical
Index contains an additional 125 million names. Access is free to the

In the presence of such significant data sets, various academic
disciplines are debating how the fields progress. Quantitative vs.
qualitative methods continue to stir spirited discussion in fields
ranging from sociology to computer science. The continuing relevance
of such essays as C.P. Snow's The Two Cultures and David Hollinger's
"The Knower and the Artificer" testify to the divide between competing
visions of inquiry and indeed truth.

A fascinating question, courtesy of my colleague Steve Sawyer,
concerns the nature of errors in data-rich versus data-poor
disciplines. Some contend that data-rich disciplines tend to be wary
of type I errors (false positives) and thus miss many opportunities by
committing false negatives (type II) that are less visible. Data-poor
communities, meanwhile, may be unduly wedded to theories given that
evidence is sparse and relatively static: in contrast to Venter's
marine discoveries, historians are unlikely to get much new evidence
of either Roman or Thomas Jefferson's politics.

Given that data is clearly valuable, bad guys are finding ways to get
and use it. Privacy is becoming a concern that is both widely shared
and variously defined. Indeed, our commentator Lawrence Baxter, who
used to be a law professor at Duke, noted that defining what privacy
is has proven to be effectively impossible. What can be defined are
the violations, which leads to a problematic state of affairs for both
law and policy.

Data breaches are growing both in number and in size: in the past year
and a half, there have been roughly 50 episodes that involved loss of
more than 100,000 records. The mechanisms for loss range from lost
backup tapes (that were not encrypted) to human error (government
officials opening or publishing databases containing personally
identifiable information) to unauthorized network access. In the
latter category, retailer TJX lost over 45 million credit- and
debit-card numbers, with the thieves, thought to be connected to
Russian organized crime, gaining access through an improperly
configured wireless network at a Marshall's store in Minnesota. Bad
policies, architecture, and procedures compounded the network problem,
to the point where TJX cannot decrypt the files created by the hackers
inside the TJX headquarters transactional system.

Part of data's attractiveness is its scale. If an intruder wanted to
steal paper records of 26 million names, as were lost by the Veterans
Administration last year after a single laptop was stolen, he or she
would need time, energy, and a big truck: counting filing cabinets,
the records would weigh an estimated 11,000 pounds. A USB drive
holding 120 gigabytes of data, meanwhile, can be as small as a 3" x 5"
card and a half-inch thick.

Redefining risk management in a data economy is proving to be
difficult, in part because IT workers have been slow to lead the way
in both seeing the value of data and treating it accordingly. To take
one notable example, the Boston Globe printed green-bar records
containing personal data relating to 240,000 subscribers, then
recycled the office paper by using it to wrap Sunday Globes for
distribution. Not surprisingly, an arms race is emerging between bad
guys, with tools such as phishing generators and network sniffers, and
the good guys, who all too often secure the barn after the horse has
run away.

Who will be the winners? That is, what companies, agencies, or
associations will use data most effectively? Axciom, Amazon, American
Express, and your college alumni office might come to mind, but it is
so early in the game that a lot can happen. Some criteria for a
potential winner, and there will of course be many, might include the

-Who is trusted?
-Who has the best algorithms?
-Who has, or can create, the cleanest data?
-Who stands closest to real transactions?
-Who controls the chain of custody?
-Who can scale?
-Who has the clearest value proposition?
-Who understands the noise in a given system?
-Who can exploit network externalities?

Whoever emerges at the front of the pack, the next few years are sure
to be a wild ride.