In my last post I described phenomena that I used to predict speed dating participants’ decisions by estimating the participants’ general selectivity and perceived desirability. I was planning on following up with a discussion of the phenomena that I used to refine the model by taking into account differences between individuals. But since comments focused on methodology rather than the empirical phenomena, I decided to write about methodology first, so that readers wouldn’t have to disbelief while reading my next post.
This post is more dense and technical than my last one. I wrote it for readers who want to check the details of the work, or who have strong interest in statistics and/or machine learning. If you don’t fall into either category but are interest in the series, you can skip it without loss of continuity.
Here I’ll address three points:
- The situation that I attempted to simulate and how faithful one should expect the simulation to be.
- The exact definitions of rating averages that I referenced in my last post
- My criteria for including a feature in the model.
Most of the code that I used is here.
The underlying question that I attempted to address is “Suppose that a speed dating company wanted to organize events with more matches. How could machine learning help?”
As Ryan Carey pointed out, the model that I developed uses data about other speed dates that participants had been on to predict decisions on a given speed date. It is possible to make nontrivial predictions exclusively using information that was available before the participants attended the events, but I haven’t systematically explored how well one could do. So the model that I developed is potentially useful only in the special case where participants had attended similar past events.
In fact, the participants in the dataset attended only a single speed dating event, not multiple events, so it’s not possible to directly check whether the model would in fact predict behavior at future events based on past events. I instead simulated a situation where participants had attended similar events in the past, by imagining that for a given date, all other dates that the pair of people had been on had occurred in a past event.
It’s very likely that the simulation overstates the predictive power that the model would give in practice, if for no other reason than regression to the mean. One example of this is that the most popular participants are more likely than usual to have been at their best on the day of the event than the other participants are, so that confidence that one can have that someone who was chosen by most of their dates at an event will be chosen by partners at a different event is lower than the confidence that one can have that the person will be chosen by partners at the same event.
If one were to apply the model in a real world setting, one would collect data that allowed one to quantify the expected regression to the mean, and also to improve the model.
Conceptually, the foundation of the model is the idea that you can infer a participant’s traits from:
- Averages of the ratings that members of the opposite sex gave the participant (one average for each type of rating).
- Averages of the ratings that the participant gave members of the opposite sex.
For the sake of limiting unnecessary verbiage, it’s useful to think of the decision that a participant makes on a partner as a “rating,” where a ‘no’ decision corresponds to a rating of 0 and a ‘yes’ decision corresponds to a rating of 1.
The first point to make is that given a rater / ratee pair, we need to exclude from consideration both the ratings that the rater gave the ratee from consideration, and the ratings that the ratee gave the rater. This is because we’re trying to predict whether two people who have never been on a speed date would be interested in seeing each other again if they were to go on a speed date.
Excluding these ratings wouldn’t be crucial if the speed dating events involved each person going on thousands of speed dates: in that case, the ratings that the two people had given each other would correspond to slight perturbations of the averages. But when an event involves only ~15 people, the impact of a single rating on somebody’s average can be large enough so that failing to exclude the individuals’ ratings of one another would substantially overstate the predictive power of the model while simultaneously obscuring what was going on.
Given a rating type R, and two participants A and B whose decisions we’re trying to predict, let R(A,B) be the rating that A gave B, and let R(B) be the sum of the ratings that were given to B. Let N be the number of people who rated B. One might think that the right features to look at are
[R(B) – R(A,B)]/(N – 1) (**)
But perhaps surprisingly, these features are contaminated with the decisions we’re trying to predict, to such a degree that if one didn’t notice this, one would end up with a model with far greater predictive power than one would have in practice. This has to do with the R(B) is generically formed using ratings that B was given after the date that A and B went on, ratings that one would not have access to in practice. The problem of understanding what’s going on is closely related to the Monte Hall problem.
If one understands the Monte Hall problem, the idea is very simple. Suppose that the dataset contained nothing but the participants’ IDs and their decisions. Suppose further that we found that participants’ decisions were yes with probability 50% on average, and that there was no statistically significant difference between raters in how often their decisions were ‘yes’ and no statistically significant difference between ratees in how often raters’ decisions on them were ‘yes.’ It’s clear that in such a situation, the model that’s most likely to generalize is a model that assigns 50% probabilities to all decisions.
Yet we could use the feature (**) with rating type “decision” to create a model that performs better on the dataset as follows. Since we know that the decision frequencies should be 50% on average, we know that if (**) is less than 50%, this is either a consequence of noise in the data (which cuts equally in both directions) or because R(A,B) is 1. If we average over the entire dataset, the noise washes out, and we see that when the feature is less than 50%, R(A,B) is more likely to be 1 than not. So predicting a ‘yes’ decision whenever (**) is greater than 50% gives a model that performs better on the dataset while corresponding to worse generalizability.
It’s only in hindsight that we know that (**) being less than 50% corresponds to a higher probability of a ‘yes’ decision: we wouldn’t know this ahead of time.
So the features (**) are contaminated by the decisions that we’re trying to predict. This is not an abstract hypothetical concern: what led me to recognize the issue is that a random forest model has more predictive power when we use (**) rather than using the average including A. In fact, the performance of the random forest model when this feature is used eclipses the performance of the best generalizable model that I was able to construct. The random forest seems to have used decision rules corresponding to reasoning of the type “the frequency with which other people chose the person is lower than I would expect of somebody so attractive, fun and likable, so probably the person was chosen this time.”
Rather than using (**), we imagine that at the event, B had been on a date with someone other than A, who we call a “surrogate” of A. We model the surrogate of A using another participant A’ that B dated. Conceptually, A’ is a randomly selected participant amongst the participants who B dated, but literally picking one at random would break the symmetry of the data in a way that could dilute the statistical power of the data, so I instead made a uniform choice to replace A by the participant who B would have dated that round if the speed dating schedule had been slightly different.
[R(B) – R(A,B) + R(A’, B)]/N
In the special case where the rating type is “decision,” the averages correspond to frequencies, and for easy of comparison with other features these are most naturally replaced by their log odds ratios, so I did this.
I normalized these averages by subtracting off the average of all ratings that participants of B’s gender would have received at the event had the surrogates of A and B attended the event in lieu of A and B. This washes out heterogeneity in raters’ rating scales from event to event.
Distinguishing noise from signal: my criteria for including a feature
In order to avoid overfitting the dataset in a way that reduces the generalizability of the findings, I imposed a high threshold for features to meet to be included in the model. From the point of view of discovery, this was very helpful insofar as it helped me discover the core phenomena that I used.
One could argue that the filters are collectively too strict, but I’ve chosen to use them for several reasons:
- The tendency to see signal in noise is so strong that it seems that it’s nearly always the case that when people make effort to avoid it, they’re not doing enough, so it seems better to err on the conservative side.
- I wanted to make an unambiguous case for the features that I did include adding incremental predictive power. I’m fairly confident that to the extent that the factors that influenced the participants at the event reflect general human behavioral tendencies, the predictive power of the features that I identified also generalizes. My main source of uncertainty is that nobody’s checked my work in detail.
- From an expository point of view, the effect sizes of the features that I excluded are arguably too small for them to warrant comment.If I were strictly focused on optimizing for predictive power, I would have included features that improve predictive power by a tiny margin with 60% confidence, but I had no reason to do so: even in aggregate, the resulting difference in predictive power wouldn’t have been striking, it’s unlikely that anyone will actually use the model, and if even if someone does, there will be opportunities to collect more data and make a better model.
What’s interesting is not so much exactly how predictive the model is, but what the main driving factors are and how they interact.
I’ve enumerated the criteria below. In practice, there’s a fair amount of redundancy between them: if a feature didn’t pass through one of them, it usually failed to pass through at least one other. But this fact only emerged gradually, and I used each individually at different times.
I tried to keep the number of features that I used small
The dataset that I’ve been working is derived from 9 speed dating events involving ~160 people of each gender, for a total of ~3000 dates. The size is sufficiently large so that we can hope to get a broad sense for what’s going on, but not sufficiently large so that we can determine the influence of individual idiosyncracies in great detail. If we hope for too much, we’re apt to base our model on patterns that don’t generalize, regardless of how much cross checking we do.
My final model uses only 5 features to predict men’s decisions and only 3 features to predict women’s decisions.
I only included a feature when the fact that it increased the model’s performance was in consonance with my intuitions
For example, I found that empirically, people who expressed a preference for people who share their interests were considered to be undesirable, but given the small size of the dataset and the absence of evidence for the phenomenon coming from other sources, using this to make predictions seemed ill-advised.
I restricted myself to using features that were derived from a relatively large number of examples, both of speed dates and of people.
The female engineering graduate students in the sample showed a very strong preference for male engineering graduate students over other men. They were also far more receptive to dating the male engineering students than other women were. The engineering/engineering cross feature passed through all other filters that I used aside from this one, but though there were 40 dates between engineering graduate students, they involved only 6 women, so I dropped the feature.
I used cross validation
Suppose there were 20 people who have some trait X, and that most of them were considered about as desirable as usual, but 2 of them were rejected by everyone. In this case it might so that it might look like people with trait X are a little less likely to be chosen. We don’t want to base our model on participants’ responses to only two people.
If we split the dataset into two subsets, train our model on one, and test it on the other, then if one of the unpopular people is in the train set and one is in the test set, including the feature could increase the model’s performance on the test set. With a dataset of this size, the boost in performance could be large enough so that one would be inclined to include the feature based on the increase in performance.
The standard method used to avoid this problem is cross-validation: instead of using a single train/test split, use many train/test splits. If including a feature in the model improves performance for a large fraction of train/test splits of sufficiently low redundancy between them, that can provide much stronger evidence that that the predictive power of the feature will generalize.
For each event, I split the data into a test set consisting the event, and a train set consisting of all other events. With this setup:
- When both of unpopular people are in the train set, including trait X as a feature makes the model’s predictions for the test set worse.
- In the instances where one of the people is in the train set and the other is in the test set, including the feature may improve performance. But there are at most 2 such instances out of 9 train/test splits.
- Should it happen that both people were at the same event, including the feature won’t improve performance for any of the events, because when the two people are in the test set, there’s no pattern in the train set for the model to pick up on.The fact that the model never does better in this case case is helpful, because flukish occurrences are more likely to be concentrated in a single event than they are to be split up over a different events: for example, maybe the two unusual people are friends who have a lot in common and signed up for the same event together.
I required that when predictions are generated in this way (with one train/test split for each event), every feature that I include improve performance
- When we average all predictions made across the whole dataset.
- For a majority of events when we look at the data by event.
- For a majority of raters when we look at the data by rater.
- For a majority of ratees when we look at the dataset by ratee.
Having spent a long time with the dataset, it was more or less clear to me that that the train/test splits that I used were enough, but I realized this may not be a priori clear, so I did a final check in which before forming the train/test splits, I removed each individual from the dataset in turn, and each wave from the dataset in turn. This is in the spirit of leave-one-out cross validation. It turns out to be overkill: (1)-(4) are never violated for any feature that I used, except for one that occasionally fell short of meeting criterion (4) by a single ratee.
I measured performance using “log loss,” which is a technical measure of the quality of probabilistic predictions. I omit a description of it because I figure that readers either already know it or don’t have the time/energy to absorb an explanation, but I can write about it if someone would like.
The tables below show how much predictive power increases when we include a given feature, starting from a base consisting of all other features that we used. Here the columns correspond to criteria (1)-(4), and the numbers in the “Avg boost” column are drops in log loss. Since I haven’t defined the features, I’ve left them unlabeled, but I’ll label them once I’ve written my next post.
|Feature||Avg boost||% events||% of raters||% ratees|
|Feature||Avg boost||% events||% of raters||% ratees|
The tables and the criteria that I described don’t tell the whole story as far as overfitting goes: the features depend on numerical parameters, which are themselves overfit to the model, in the sense that to some extent I picked them with a view toward maximizing the numbers in the table.
But this sort of overfitting corresponds to optimizing the expected performance of the model on hypothetical future datasets, which is the opposite of picking features that are likely to be predictive only in the context of the dataset. It overstates the predictive power of the model in more general contexts, but it’s simultaneously the case that not doing it would produce a model that performs worse in general settings.
The choices that I made seem fairly natural, and to the extent that they overstate the model’s predictive power, the effect seems likely to be minor. If one had more data, one could obtain improved estimates for the numerical parameters. The more serious distortion in potential predictive power comes from the absence of data on participants across multiple events.