Tuesday, June 25, 2013

Early Indications June 2013: What gets measured, gets . . .

It has become a business truism that “what gets measured, gets managed” after the great Peter Drucker allegedly wrote it. (There is no citation, however, and it may be that original credit goes to Lord Kelvin, who stated that “If you cannot measure it, you cannot improve it.”) In the “big data” era, it has become an article of faith that the more measurements we can gather and presumably analyze, the more we can optimize behavior that drives medical outcomes, social welfare, and corporate profitability. While I believe that we will see some extremely positive validations of this hypothesis, there are also enough cautionary tales that suggest some skepticism is warranted before accepting the promises of the big data evangelists as articles of faith.

Five unrelated examples combine to suggest an alternative mantra:

what gets measured, gets gamed.

That is, the scorecard gets attention _at the expense of_ the nominal task that was being measured in the first place.
Example 1: A former student reported that forecasting tools in a consumer products company were generating remarkably consistent projections, regardless of seasonality, competitors’ new product launches, or other visible alterations to the landscape. After some investigation, it was determined that a specific forecast curve had become popular (whether with procurement, finance, marketing, or plant managers was not made clear). To generate the “acceptable” forecast month after month, analysts took to [essentially] defeating the forecast algorithms by adjusting past actual quantities: to get the future curve they wanted, employees rewrote history.

Sales forecasting is gamed by definition, given the way commissions, market uncertainty, and expectation management affect the process. Numerous attempts have been made to induce “best-guess” estimations by the sales people, but even those companies that deployed prediction markets reported mixed results.

Example 2: For a time there was a breed of financial planner who was paid not on the basis of his or her clients’ rate of return, but by commissions generated by equities trades. Not surprisingly, clients did not get advice based on the long-term growth of their portfolio, but on the hottest stock of the moment. Moving clients in and out of different equities based on magazine cover stories proved to be good business for the planners, and only incidentally and accidentally profitable for the clients.

Example 3: A former colleague of mine recently analyzed the marketing activities of a large technology company. Even though the company sells B-to-B with a direct sales force, an executive dashboard someplace measures website clicks. The word came down through marketing that each product group had to “win the dashboard,” in this case, piling up web clicks through heavy ad placement even though this behavior could in no way be tied to revenue, customer satisfaction, or even lead generation.

Example 4 comes from closer to home. Course evaluations have become the focus of many universities’ professional assessments of non-research faculty, trying to ensure that students feel the instructor did his or her job. At Penn State the forms are called not “course evaluations” but SRTEs: Student Ratings of Teacher Effectiveness, though I doubt I am alone in believing the E stands for Entertainment. In my last teaching job, now 20 years ago, I was known to game the evaluation process, bringing cookies for the class on the last day before passing out paper evaluation forms. In our modern age, however, the assessment has gone online, so students are able to fill out the forms at their convenience and administrators can get scores reported in days rather than the months it took to code paper instruments.

At Penn State, the move toward paperless assessment has coincided with a startling drop in the completion rate. Like some other schools, we have an institute for the advancement of teaching skills. Upon seeing the drop in SRTE completion, our center undertook a project to try to improve compliance with the assessment. Note that these efforts do nothing to improve pedagogy or understand why compliance is dropping; the focus on the course assessment process is completely unrelated to helping students learn. Once again, the tail is wagging the dog.

Example 5: Information technology has become the backbone of most modern organizations. Grading the performance of the IS group, however, is extremely difficult. In many IS shops, measuring system uptime is readily quantifiable and usually scores in the high 90s. (For reference, 99.5% is a great score on a test but in this context it means the system was down for almost two full days a year.) What is much more difficult to measure, yet more important to business performance, is whether the right applications were running in the first place, how much inefficiency in the data center was required to get the gaudy uptime number, or how good the data was that the system delivered. Information quality is one of those metrics that is incredibly hard (and sometimes embarrassing) to measure, hard to improve, and hard to justify in terms of conventional ROI. Yet while it is, more often than not, truly critical for business performance, information quality was not in years past a component of a CIO’s performance plan. I’m told the situation is changing, although measuring application portfolio management – how well IS gets the right tools into production and the old ones retired – remains a challenge.

The five examples, along with many others from your own experience, suggest two important lessons. First, more data will by definition – thank you Claude Shannon – contain more noise. As Nassim Taleb notes in his critique of uncritical big-data love, more data simply means more cherry-picking (and not, Nate Silver would add, better hypothesis generation).

Second, in the domain of human management, incentive structures remain hard to get right, so there will be more and more temptations to let “numbers speak for themselves.” Such attitudes can emphasize the most readily measured phenomena, often of activity rather than outcomes – web clicks are easier to count than conversions; sales calls are easier to generate than revenues; incoming SAT scores are easier to average than student loan debt or job placement rates of the graduating class.

One would hope that getting the assessments right, even though it usually means counting something that doesn’t look as good, should dictate performance assessment. Given so much evidence from the worlds of medicine, commerce, sports, the military (remember Robert McNamara's "kill ratios"?), and academia to the contrary, however, it would appear that these games will forever be with us.