Evaluating A Predictive Model: Good Smells and Bad Smells

Post 1: Sniffing for Bull***t.

As a people analytics professional, you are now expected to make decisions about whether to use various predictive models. This is a surprisingly difficult question with important consequences for your employees and job applicants. In fact, I started drafting up a lovely little three section blog post around this topic before realizing that there was zero chance that I was going to be able to pack everything into a single post.

There are simply no hard and fast rules you can follow to know if a model is good enough to use “in the wild.” There are too many considerations. To take an initial example, what are the consequences of being wrong? Are you predicting whether someone will click on an ad, or whether someone has cancer? In fact, even talking about model accuracy is multifaceted. Are you worried about detecting everyone who does have cancer-- even at the risk of false positives? Or are you more concerned about avoiding false positives?

Side note: If you are a people analytics professional, you ought to become comfortable with the idea of precision and recall. Many people have produced explanations of these terms so we won’t go into it here. Here is one from “Towards Data Science”.

So all that said, instead of a single, long post attempting to cover a respectable amount of this topic, we are going to put out a series of posts under that heading: Evaluating a predictive model: Good Smells and Bad Smells.

And, since I’ve never met an analogy that I wasn’t willing to beat to death, we’ll use that smelly comparison to help you keep track of the level at which we are evaluating a model. For example, in this post we’re going to start way out at bull***t range.

Sniffing for Bull***t

As this comparison implies, you ought to be able to smell these sorts of problems from pretty far out. In fact, for these initial checks, you don’t even have to get close enough to sniff around at the details of the model. You’re simply going to ask the producers of the model (vendor or in-house team) a few questions about how they work to see if they are offering you potential bull***t.

At One Model, we're always interested in sharing our thoughts on predictive modeling. One of these great chats are available on the other side of this form.

Back to our scheduled programming.

Remember that predictions are not real. Because predictive models generate data points, it is tempting to treat them like facts. But they are not facts. They are educated guesses. If you are not committed to testing them and reviewing the methodology behind them, then you are contenting yourself with bull***t. Technically speaking, by bull***t, I mean a scenario in which you are not actually concerned with whether the predictions you are putting out are right or wrong.

For those of you looking for a more detailed theory of bull***t, I direct you to Harry G. Frankfurt.

At One Model we strive to avoid giving our customers bull***t (yay us!) by producing models with transparency and tractability in mind. By transparency we mean that we are committed to showing you exactly how a model was produced, what type of algorithm it is, how it performs, how features were selected, and other decisions that were made to prepare and clean the data. By tractability we mean that the data is traceable and easy to wrangle and analyze.

When you put these concepts together you end up with predictive models that you can trust with your career and the careers of your employees. If, for example, you produce an attrition model, transparency and tractability will mean that you are able to educate your data consumers on how accurate the model is. It will mean that you have a process set up to review the results of predictions over time and see if they are correct. It will mean that if you are challenged about why a certain employee was categorized as a high attrition risk, you will be able to explain what features were important in that prediction. And so on.

To take a counter example, there’s an awful lot of machine learning going on in the talent acquisition space. Lots of products out there are promising to save your recruiters time by using machine learning to estimate whether candidates are a relatively good or a relatively bad match for a job. This way, you can make life easier for your recruiters by taking a big pile of candidates and automagically identifying the ones that are the best fit.

I suspect that many of these offerings are bull***t. And here are a few questions you can ask the vendors to see if you catch a whiff (or perhaps an overwhelming aroma) of bull***t. The same sorts of questions would apply for other scenarios, including models produced by an in-house team.

Hey, person offering me this model, do you test to see if these predictions are accurate?

Initially I thought about making this question “How do you” rather than “Do you”. I think “Do you” is more to the point. Any hesitation or awkwardness here is a really bad smell. In the talent acquisition example above, the vendor should at least be able to say, “Of course, we did an initial train-test split on the data and we monitor the results over time to see if people we say are good matches ultimately get hired.”

Now later on, we might devote a post in this series to self-fulfilling prophecies. Meaning in this case that you should be on alert for the fact that by promoting a candidate to the top of the resume stack, you are almost certainly going to increase the odds that they are hired and, thus, you are your model is shaping, rather than predicting the future. But we’re still out at bull***t range so let’s leave that aside.

And so, having established that the producer of the model does in fact test their model for accuracy, the next logical question to ask is:

So how good is this model?

Remember that we are still sniffing for bull***t. The purpose of this question is not so much to hear whether a given model has .75 or .83 precision or recall, but just to test if the producers of the model are willing to talk about model performance with you. Perhaps they assured you at a high level that the model is really great and they test it all the time-- but if they don’t have any method of explaining model performance ready for you… well… then their model might be bull***t.

What features are important in the model? / What type of algorithm is behind these predictions?

These follow up questions are fun in the case of vendors. Oftentimes vendors want to talk up their machine learning capabilities with a sort of “secret sauce” argument. They don’t want to tell you how it works or the details behind it because it’s proprietary. And it’s proprietary because it’s AMAZING. But I would argue that this need not be the case and that their hesitation is another sign of bull***t.

For example, I have a general understanding of how the original Page Rank algorithm behind Google Search works. Crawl the web and work out the number of pages that link to a given page as a sign of relevance. If those backlinks come from sites which themselves have large numbers of links, then they are worth more. In fact, Sergey Brin and Larry Page published a paper about it. This level of general explanation did not prevent Google from dominating the world of search.

In other words, a lack of willingness to be transparent is a strong sign of bull***t.

How do you re-examine your models?

Having poked a bit at transparency, these last questions get into issues of tractability. You want to hear about the capabilities that the producers of the model have to re-examine the work they have done. Did they build a model a few years ago and now they just keep using it? Or do they make a habit of going back and testing other potential models. Do they save off all their work so that they could easily return to the exact dataset that was used to train a specific version of the model. Are they set up to iterate or are they simply offering a one-size fits all algorithm to you?

Good smells here will be discussions about model deployment, maintenance and archiving. Streets and sewers type stuff as one of my analytics mentors likes to say. Bad smells will be high level vague assurances or -- my favorite -- simple appeals to how amazingly bright the team working on it is.If they do vaguely assure you that they are tuning things up “all the time” then you can hit them with this follow up question:

Could you go back to a specific prediction you made a year ago and reproduce the exact data set and version of the algorithm behind it?

This is a challenging question and even a team fully committed to transparency and tractability will probably hedge their answers a bit. That’s ok. The test here not just about whether they can do it, but whether they are even thinking about this sort of thing. Ideally it opens up a discussion about you they will support you, as the analytics professional responsible for deploying their model, when you get challenged about a particular prediction. It’s the type of question you need to ask now because it will likely be asked of you in the future.

As we move forward in this blog series, we’ll get into more nuanced situations. For example, reviewing the features used in the predictions to see if they are diverse and make logical sense. Or checking to see if the type of estimator (algorithm) chosen makes sense for the type of data you provided.

But if the model that you are evaluating fails the bull***t smell test outlined here, then it means that you’re not going to have the transparency and tractability necessary to pick up on those more nuanced smells. So do yourself a favor and do a test whiff from a ways away before you stick your nose any closer.

Written By

As One Model’s Solution Architect, Phil gets paid to be excited about People Analytics. This is a pretty good deal for a naturally excitable person with 10 years of experience in HR and analytics - especially one who drinks more coffee than anyone on the team, except David Wilson.

Ready to learn more?

Request a tailored demo to see how One Model could help you.