Five unrelated examples combine to suggest an alternative
mantra:
what gets measured, gets gamed.
That is, the scorecard gets attention _at the expense of_ the
nominal task that was being measured in the first place.
Example 1: A former student reported that forecasting tools
in a consumer products company were generating remarkably consistent
projections, regardless of seasonality, competitors’ new product launches, or
other visible alterations to the landscape. After some investigation, it was
determined that a specific forecast curve had become popular (whether with
procurement, finance, marketing, or plant managers was not made clear). To
generate the “acceptable” forecast month after month, analysts took to
[essentially] defeating the forecast algorithms by adjusting past actual
quantities: to get the future curve they wanted, employees rewrote history.
Sales forecasting is gamed by definition, given the way commissions,
market uncertainty, and expectation management affect the process. Numerous
attempts have been made to induce “best-guess” estimations by the sales people,
but even those companies that deployed prediction markets reported mixed
results.
Example 2: For a time there was a breed of financial planner
who was paid not on the basis of his or her clients’ rate of return, but by
commissions generated by equities trades. Not surprisingly, clients did not get
advice based on the long-term growth of their portfolio, but on the hottest
stock of the moment. Moving clients in and out of different equities based on
magazine cover stories proved to be good business for the planners, and only
incidentally and accidentally profitable for the clients.
Example 3: A former colleague of mine recently analyzed the
marketing activities of a large technology company. Even though the company
sells B-to-B with a direct sales force, an executive dashboard someplace
measures website clicks. The word came down through marketing that each product
group had to “win the dashboard,” in this case, piling up web clicks through
heavy ad placement even though this behavior could in no way be tied to
revenue, customer satisfaction, or even lead generation.
Example 4 comes from closer to home. Course evaluations have
become the focus of many universities’ professional assessments of non-research
faculty, trying to ensure that students feel the instructor did his or her job.
At Penn State the forms are called not “course evaluations” but SRTEs: Student
Ratings of Teacher Effectiveness, though I doubt I am alone in believing the E
stands for Entertainment. In my last teaching job, now 20 years ago, I was
known to game the evaluation process, bringing cookies for the class on the
last day before passing out paper evaluation forms. In our modern age, however,
the assessment has gone online, so students are able to fill out the forms at
their convenience and administrators can get scores reported in days rather
than the months it took to code paper instruments.
At Penn State, the move toward paperless assessment has
coincided with a startling drop in the completion rate. Like some other
schools, we have an institute for the advancement of teaching skills. Upon
seeing the drop in SRTE completion, our center undertook a project to try to
improve compliance with the assessment. Note that these efforts do nothing to
improve pedagogy or understand why compliance is dropping; the focus on the
course assessment process is completely unrelated to helping students learn. Once
again, the tail is wagging the dog.
Example 5: Information technology has become the backbone of
most modern organizations. Grading the performance of the IS group, however, is
extremely difficult. In many IS shops, measuring system uptime is readily
quantifiable and usually scores in the high 90s. (For reference, 99.5% is a
great score on a test but in this context it means the system was down for
almost two full days a year.) What is much more difficult to measure, yet more
important to business performance, is whether the right applications were
running in the first place, how much inefficiency in the data center was
required to get the gaudy uptime number, or how good the data was that the
system delivered. Information quality is one of those metrics that is incredibly hard
(and sometimes embarrassing) to measure, hard to improve, and hard to justify
in terms of conventional ROI. Yet while it is, more often than not, truly
critical for business performance, information quality was not in years past a component
of a CIO’s performance plan. I’m told the situation is changing, although measuring
application portfolio management – how well IS gets the right tools into
production and the old ones retired – remains a challenge.
The five examples, along with many others from your own
experience, suggest two important lessons. First, more data will by definition
– thank you Claude Shannon – contain more noise. As Nassim Taleb notes in his
critique of uncritical big-data love, more data simply means more
cherry-picking (and not, Nate Silver would add, better hypothesis generation).
Second, in the domain of human management, incentive
structures remain hard to get right, so there will be more and more temptations
to let “numbers speak for themselves.” Such attitudes can emphasize the most readily
measured phenomena, often of activity rather than outcomes – web clicks are
easier to count than conversions; sales calls are easier to generate than
revenues; incoming SAT scores are easier to average than student loan debt or
job placement rates of the graduating class.
One would hope that getting the assessments right, even though it usually means
counting something that doesn’t look as good, should dictate
performance assessment. Given so much evidence from the worlds of medicine,
commerce, sports, the military (remember Robert McNamara's "kill ratios"?), and academia to the contrary, however, it would
appear that these games will forever be with us.