You’re in an ML Engineer interview at Stripe.
The interviewer asks:
"People often dispute transactions they actually made.
How to build a supervised model that predicts fake disputes?
There’s no labeled data."
You: "I'll flag cards with high dispute rates."
Interview over.
Here's what you missed:
Active learning is a relatively easy and inexpensive way to build supervised models when you don’t have annotated data to begin with.
As the name suggests, the idea is to build the model with active human feedback on examples it is struggling with.
The visual below summarizes this.
1) Begin by manually labeling a tiny percentage of your data.
2) Build a model on this small labeled dataset. This won’t be a good model, but that’s fine.
3) Next, generate predictions on the dataset you did not label.
Since the dataset is unlabeled, we cannot determine if these predictions are correct.
That’s why we train a model that can implicitly or explicitly provide a confidence level with its predictions.
Probabilistic models are a good fit since one can determine a proxy for confidence level from probabilistic outputs, like the gap between 1st and 2nd highest probabilities.
4) After generating the confidence, rank all predictions in order of confidence.
5) Provide a human label to the low-confidence predictions and feed it back to the model with the seed dataset. There’s no point in labeling predictions that the model is already confident about.
6) Repeat the process a few times (train → generate predictions and confidence → label low confidence prediction) and stop when you are satisfied with the performance.
Active learning is a huge time-saver in building supervised models on unlabeled datasets.
The only thing that you have to be careful about is generating confidence measures.
If you mess this up, it will affect every subsequent training step.
Also, while combining the low-confidence data with the seed data, we can use the high-confidence data. The labels would be the model’s predictions.
This variant of active learning is called cooperative learning.
👉 Over to you: Have you used active learning before?