Presentation: Policing The Stock Market with Machine Learning

Location:

Duration

Duration: 
4:10pm - 5:00pm

Day of week:

Level:

Persona:

Key Takeaways

  • Learn how terabyte data can be processed on a single machine
  • Analyze how big data visualisations can highlight illegal trading
  • Understand how data storage and computing power are growing faster than problem sizes

Abstract

Neurensic has built a solution, SCORE, for doing Trade Surveillance using H2O (an open-source pure Java Big Data ML tool), Machine Learning, and a whole lot of domain expertise and data munging. SCORE pulls in private and public market data and in a few minutes will search it for all sorts of bad behavior: spoofing, wash-trading and cross-trading, pinging, and a lot more. It then filters down the billions of rows of data down to human scale with some great visualizations - well enough to use as hard legal evidence. Indeed SCORE and it's underlying tech is not just used by companies to police themselves; it is being used by the public sector to find and prosecute the Bad Guys. I'll close with a demo of a Real Life bad guy - he was defrauding the markets out of 10's of millions - who got caught via an early alpha version of SCORE. All data anonymized of course, so you'll have to go hunt last years Wall Street Journal to find his name for real.

Interview

Question: 
What is the focus of your work today?
Answer: 

I’m working on SCORE, which is a tool for trade surveillance. In the capital markets (stock markets, futures trading and the like) there’s an obligation to ensure that the trading is legal, that you’re not trying to be fraudulent in some way.

So most large firms that do trading engage a large number of traders who are not necessarily employees of hte company but who are freelance experts. The companies who host them and who give them access to an exchange have a legal obligation to do surveillance to understand what the traders are doing. SCORE is a new tool for doing trade surveillance.

Question: 
What’s the motivation for your talk?
Answer: 

The technology that’s in place right now is 20 years old, and SCORE is state-of-the-art. It’s going to be H2O on a Java process (instead of an old-school Windows DLL) and it’s going to use machine learning instead of a rules based engine, along with a GUI in a browser that’s friendlier to work with. It’s also hugely faster and more accurate than what’s gone on before.

The other side of it is: we’re solving a very human problem. We’re finding people who are attempting to cheat (or successfully cheating) the stock markets today. There’s a steady stream of people who are attempting to defraud people out of the markets and who are by and large successful because the tools for detecting them are really old. Plus, they know how to work around the tools and they do so on a regular basis.

So as soon as we turn on SCORE in a new trading house, we immediately find people who have been cheating for a long time. It’s very obvious that they’re cheating as soon as you look at the visualisations that are coming out of the tool. We are catching people who are doing big dollar cheating.

Question: 
How you you describe the persona of the target audience of this talk?
Answer: 

Probably Data Arch, some Arch, some Dev JVM, - also applies to CCO/CRO (Chief Compliance Officer, Chief Risk Officer) but not expecting too many in the audience

There are two types of audience persona: those who are interested in big data, using H2O as a big-data Java product, as well as big data miners, as well as a success story of how big data and machine learning work.

The other part is a human interest side: it’s fun to hear about someone who is cheating, how they are cheating, and how you catch them. There will be examples shown and people who have subsequently gone to jail.

Question: 
What tools and techniques am I going to be able to take away from this talk?
Answer: 

I took the existing prior toolchain on SCORE and threw out the database (MariaDB and Hadoop)-- I’m running on a single structured file system with a single JVM process running H2O. This combination goes a long way to solve problems that stack up to tens of terabytes (although not to petabytes and beyond).

There’s a GUI with Elm - which will be covered during the talk - but the key takeaway is that using a single JVM on a single machine with a structured file system can scale to handling terabytes worth of data.

Question: 
How would you rate the level of this talk?
Answer: 

Mid-to-expert.

Question: 
What do you feel is the most disruptive tech in IT right now?
Answer: 

I’ll say that constant shrinking of memory cells and the constant increase in size of memory on a single node means that a lot more problems don’t need clustering in order to be solved. Originally I used H2O for its potential clustering ability, but unless you’re Google you can solve a lot of problems on a single fat node. In addition, nodes are getting fatter faster than the problems are getting larger. So today I can buy a 512Gb machine and tomorrow I can buy a 1Tb machine, which is sufficient to solve most problems.

There’s a huge market opportunity for Tb big data problems using non-clustered technology. Hadoop is solution for a giant filesystem or giant mapreduce. Since the disks are getting so big I don’t need a giant filesystem to be able to store the data in a distributed fashion for a lot of problems.

Question: 
QCon targets advanced architects and sr development leads, what do you feel will be the actionable that type of persona will walk away from your talk with?
Answer: 

Simplify your architecture! No DB (unless you really need atomic updates; append-only does NOT count). No Hadoop (unless you really need high 10’s of Tbs and up data scales). Single machines are not hugely faster… but memory on 1 node has jumped up to low Tb counts. So single machine in-memory.

Speaker: Cliff Click

CTO @Neurensic

Cliff Click is the CTO of Neurensic, and before that the CTO and Co-Founder of h2o.ai, the makers of H2O an open source math and machine learning engine for Big Data. Cliff wrote his first compiler when he was 15 (Pascal to TRS Z-80!), although Cliff’s most famous compiler is the HotSpot Server Compiler (the Sea of Nodes IR). That compiler showed the world that JIT'd high quality code was possible, and was at least partially responsible for bringing Java into the mainstream. Cliff helped Azul Systems build an 864 core pure-Java mainframe that keeps GC pauses on 500Gb heaps to under 10ms, and worked on all aspects of that JVM. Cliff is invited to speak regularly at industry and academic conferences and has published many papers about HotSpot technology. He holds a PhD in Computer Science from Rice University and about 20 patents.

Find Cliff Click at

Similar Talks

Research Engineer @FastForwardLabs, Keras Contributor
Director of Research @FastForwardLabs
Elm Pioneer & Software Engineer @noredink
Pulumi Co-founder & CEO, Previously @Microsoft Director of Engineering for Languages/Compilers
High Performance & Low Latency Specialist

Tracks

Conference for Professional Software Developers