Presentation: Startup ML: bootstrapping a fraud detection system

Location:

Duration

Duration: 
10:35am - 11:25am

Day of week:

Key Takeaways

  • Understand how Stripe got started with Machine Learning, including the motivations and methods.
  • Learn the basic steps to go from 0 to 60% of an ML solution very quickly.
  • Walk away with actionable knowledge to load data into module, understand what ML methods you should employ, and, ultimately, build a model.

Abstract

Stripe processes billions of dollars a year for businesses around the world. To protect its users from fraud, Stripe employs machine learning to detect potentially fraudulent transactions. In this talk, I'll describe how we bootstrapped this system and some of the most important aspects of industrial machine learning. We'll talk about how to choose, train, and evaluate models, how to bridge the gap between training and production systems, and how to address common pitfalls using the problem of fraud detection as our motivation. By the end of the talk, you should be familiar with many of the core concepts in practical machine learning: regression, random forests, training and validation sets, ROC and AUC curves, and production scoring, monitoring, and evaluation.

Interview

Question: 
QCon: What is your role today?
Answer: 
Michael: I manage the Machine Learning Products team at Stripe.
I joined Stripe as an engineer and built the first ML systems here. We ended up forming an ML team and then moved into management.
I don’t spend that much time directly building anymore, but I work the engineers on my team on the bigger data science and architectural decisions.
Question: 
QCon: Have you always been ML-focused as an engineer?
Answer: 
Michael: I was at Google before Stripe. At Google, I was doing back-end infrastructure engineering. I have a math background and did the whole PhD and postdoc thing. So the combination of being a mathy person and having engineering experience made me suited to do ML when I joined Stripe. But I actually had not done any ML before arriving at Stripe.
Question: 
QCon: What is the motivation for your talk?
Answer: 
Michael: This talk is about how you go from to zero to 60% with ML pretty quickly. It’s about how you can start from nothing and end up building the first pieces to get you on the right track for having a robust machine learning system.
Question: 
QCon: What are the main takeaways for the talk?
Answer: 
Michael: First, the audience will become familiar with how Stripe got started in ML. I’ll answer questions like: what motivated the move to ML, what we did to start along the path, and what went well (or didn't go well).
Another takeaway is that generalists will walk away understanding the basic ML techniques that get them most of the way to a solution.
If you have some problem (say you want to build a fraud detection system), how do you decide between all these buzzwords in ML? Do you want to use a neural net? Probably not. Do you want to use deep learning? Probably not (at least not at first). You want to start with the staples--regression and random forests.
Part two of the talk is about the techniques that one wants to know when they get started in any ML area.
I hope that the talk will be accessible to generalists who are interested in moving into ML, but I also think that there will be useful information here for ML practitioners.
Question: 
QCon: How deep do you dive into ML concepts like random forests and regression?
Answer: 
Michael: For both of those two things, I want to cover a couple of ideas.
I’ll talk about the mental model behind them (without going deeply into the theory of how things work)--I want people to understand what a logistic regression is and how you think about it, what a random forest is, etc. I will talk about the pros and cons of each of them and give some specific implementation examples (in code). By implementation, I mean how can you build a random forest model using off the shelf open source tools (not how to implement the algorithm).
So I hope to cover those three things: the mental model, pros and cons, and enough code to get started. The idea is one can go home, load in a CSV into a Python REPL, and build a model.

Tracks

Covering innovative topics

Monday, 7 March

Tuesday, 8 March

Wednesday, 9 March

Conference for Professional Software Developers