In the beginning, there was data, enumerating how many, what kind,
where. Data was kept in proprietary formats and physically located:
if the library was missing the Statistical Abstract for 1940, or some
other grad student had sequestered it, you had little chance to
determine corn production in Nebraska before World War II. Such
statistics were the exception: most data remained unpublished, in lab
notebooks and elsewhere.
Once data escaped from print into bits, it became potentially
ubiquitous, and once formats became less proprietary, more people
could gain access to more forms of data. The early history of the web
was built in part on a footing of public access to data: online
collections of maps, congressional votes, stock prices, phone numbers,
product catalogs, and other data proliferated.
Data has always required metadata: that table of corn production had a
title and probably a methodological footnote. Such metadata was
typically contributed by an expert in either the technical field or in
the practice of categorizing. Official taxonomies have continued the
tradition of creators and curators having cognitive authority in the
process of organizing. In addition, as Clay Shirky has pointed out in
"Ontology is Overrated," the heritage of physicality led to the need
for one answer being correct so that an asset could be found: a book
about Russian and American agricultural policy during the 1930s had to
live among books on Russian history, agricultural history, or U.S.
history: it was arguably about any or all of those things, but someone
(most likely at the Library of Congress) assigned it a catalog number
that finalized the discussion: the book in question was officially and
forever "about" this more than it was about that.
In the past decade, the so-called read-write web has allowed anyone to
become both a content creator and a metadata creator. Sometimes these
activities coincide, as when someone tags their own YouTube video for
example. More often, creations are submitted to a commons, and the
commoners (rather than a cognitive authority) determine what the
contribution "is" and what it is "about." Rather than editors or peer
reviewers judging an asset's quality before publication, in more and
more settings the default process is publication then collaborative
filtering for definition, quality, and meaning.
Imagine a particular propane torch for sale on Amazon.com. So-called
social metadata has been nurtured and collected for years on the site.
If I appreciate the way the torch works for its intended use of
brazing copper pipe, I can submit a review with both a star rating and
prose. Amazon quickly allowed for more social metadata as you the
reader of my review can now rate my review, thus creating metadata
about metadata.
Here is where the discussion gets complicated and extremely
interesting. Suppose I say in my review that I use the Flamethrower
1000 for creme brulee even though the device is not rated (by whatever
safety or sanitation authority) for kitchen use. The comments about
my torch review can quickly become a foodie discussion thread: the
best creme brulee recipe, the best restaurants at which to order it,
regional variations in the naming or preparation of creme brulee, and
so forth. Amazon's moderators might truncate the discussion to the
extent it's not "about" the Flamethrower 1000 under review, but the
urge to digress has long been and will be demonstrated elsewhere.
Enter Facebook. The platform is in essence a gigantic metadata
generation and distribution system. ("I liked the concert." "The
person who liked the concert did not know what she was talking about."
"My friend was at the concert and said it was uneven." and so on)
Strip Facebook of attribute data and there is little left: it's
essentially a mass of descriptors (including "complicated"), created
by amateurs and never claimed as authoritative, linked by a
21st-century kinship network. Facebook's announcement on April 21st
of the Open Graph institutionalizes this collection of conversations
as one vast, logged, searchable metadata repository. If I "like"
something, my social network can be alerted, and the website object of
my affection will know as well.
Back in November, Bruce Schneier laid out five categories of social
networking data:
1. Service data. Service data is the data you need to give to a social
networking site in order to use it. It might include your legal name,
your age, and your credit card number.
2. Disclosed data. This is what you post on your own pages: blog
entries, photographs, messages, comments, and so on.
3. Entrusted data. This is what you post on other people's pages. It's
basically the same stuff as disclosed data, but the difference is that
you don't have control over the data -- someone else does.
4. Incidental data. Incidental data is data the other people post
about you. Again, it's basically the same stuff as disclosed data, but
the difference is that 1) you don't have control over it, and 2) you
didn't create it in the first place.
5. Behavioral data. This is data that the site collects about your
habits by recording what you do and who you do it with.
What does that list look like today? A user's trail of "like" clicks
makes this list or her Netflix reviews and star ratings, themselves
the subject of privacy concerns, seem like merely the tip of the
iceberg. As Dan Frankowski said in his Google Talk on data mining,
people have been defined by their preferences for millennia --
sometimes to the point of dying for them.
With anything so new and so massive in scale (50,000 sites adopted the
"like" software toolkit in the first week), the unexpected
consequences will take months and more likely years to accumulate.
What will it mean when every opinion we express on line, from the
passionate to the petty, gets logged in the Great Preference
Repository in the Sky, never to be erased and forever being able to be
correlated, associated, regressed, and otherwise algorithmically
parsed?
Several questions follow: who will have either direct or indirect
access to the metadata conversation? What are the opt-in, opt-out,
and monitoring/correction provisions? If I once mistakenly clicked a
Budweiser button but have since publicly declared myself a Molson man,
can I see my preference library as if it's a credit score and remedy
any errors or misrepresentations? What will be the rewards for brand
monogamy versus the penalties for promiscuous "liking" of every
product with a prize or a coupon attached?
While this technology appears to build barriers to competitive entry
for Facebook, what happens if I establish a preference profile when
I'm 14, then decide I no longer like zoos, American Idol, or Gatorade?
Will people seek a fresh start at some point in an undefined network,
with no prehistory? What is the mechanism for "unliking" something,
and how far retrospectively will it apply?
Precisely because Facebook is networked, we've come a very long way
from from that Statistical Abstract on the library shelf. What
happens to my social metadata once it traverses my network? How much
or how little control do I have over what my network associates
("friends" in Facebook-speak) do with my behavioral and opinion data
that comes their way? As both the Burger King "Whopper Sacrifice"
(defriend ten people, get a hamburger coupon) and a more recent
Ikea-spoofing scam have revealed, Facebook users will sell out their
friends for rewards large and small, whether real or fraudulent.
Finally, to the extent that Facebook is both free to use and expensive
to operate, the Open Graph model opens a fascinating array of revenue
streams. If beggars can't be choosers, users of a free system have
limited say in how that system survives. At the same time, the global
reach of Facebook exposes it to a broad swath of regulators, not the
least formidable of whom come out of the European Union's strict
privacy rights milieu. As both the uses and inevitable abuses of the
infinite metadata repository unfold, the reaction will be sure to be
newsworthy.