Conference:March 6-8, 2017
Workshops:March 9-10, 2017
Presentation: Policing The Stock Market with Machine Learning
Location:
- Churchill, G flr.
Duration
Day of week:
- Tuesday
Level:
- Advanced
Persona:
- Developer
Key Takeaways
- Learn how terabyte data can be processed on a single machine
- Analyze how big data visualisations can highlight illegal trading
- Understand how data storage and computing power are growing faster than problem sizes
Abstract
Neurensic has built a solution, SCORE, for doing Trade Surveillance using H2O (an open-source pure Java Big Data ML tool), Machine Learning, and a whole lot of domain expertise and data munging. SCORE pulls in private and public market data and in a few minutes will search it for all sorts of bad behavior: spoofing, wash-trading and cross-trading, pinging, and a lot more. It then filters down the billions of rows of data down to human scale with some great visualizations - well enough to use as hard legal evidence. Indeed SCORE and it's underlying tech is not just used by companies to police themselves; it is being used by the public sector to find and prosecute the Bad Guys. I'll close with a demo of a Real Life bad guy - he was defrauding the markets out of 10's of millions - who got caught via an early alpha version of SCORE. All data anonymized of course, so you'll have to go hunt last years Wall Street Journal to find his name for real.
Interview
I’m working on SCORE, which is a tool for trade surveillance. In the capital markets (stock markets, futures trading and the like) there’s an obligation to ensure that the trading is legal, that you’re not trying to be fraudulent in some way.
So most large firms that do trading engage a large number of traders who are not necessarily employees of hte company but who are freelance experts. The companies who host them and who give them access to an exchange have a legal obligation to do surveillance to understand what the traders are doing. SCORE is a new tool for doing trade surveillance.
The technology that’s in place right now is 20 years old, and SCORE is state-of-the-art. It’s going to be H2O on a Java process (instead of an old-school Windows DLL) and it’s going to use machine learning instead of a rules based engine, along with a GUI in a browser that’s friendlier to work with. It’s also hugely faster and more accurate than what’s gone on before.
The other side of it is: we’re solving a very human problem. We’re finding people who are attempting to cheat (or successfully cheating) the stock markets today. There’s a steady stream of people who are attempting to defraud people out of the markets and who are by and large successful because the tools for detecting them are really old. Plus, they know how to work around the tools and they do so on a regular basis.
So as soon as we turn on SCORE in a new trading house, we immediately find people who have been cheating for a long time. It’s very obvious that they’re cheating as soon as you look at the visualisations that are coming out of the tool. We are catching people who are doing big dollar cheating.
Probably Data Arch, some Arch, some Dev JVM, - also applies to CCO/CRO (Chief Compliance Officer, Chief Risk Officer) but not expecting too many in the audience
There are two types of audience persona: those who are interested in big data, using H2O as a big-data Java product, as well as big data miners, as well as a success story of how big data and machine learning work.
The other part is a human interest side: it’s fun to hear about someone who is cheating, how they are cheating, and how you catch them. There will be examples shown and people who have subsequently gone to jail.
I took the existing prior toolchain on SCORE and threw out the database (MariaDB and Hadoop)-- I’m running on a single structured file system with a single JVM process running H2O. This combination goes a long way to solve problems that stack up to tens of terabytes (although not to petabytes and beyond).
There’s a GUI with Elm - which will be covered during the talk - but the key takeaway is that using a single JVM on a single machine with a structured file system can scale to handling terabytes worth of data.
Mid-to-expert.
I’ll say that constant shrinking of memory cells and the constant increase in size of memory on a single node means that a lot more problems don’t need clustering in order to be solved. Originally I used H2O for its potential clustering ability, but unless you’re Google you can solve a lot of problems on a single fat node. In addition, nodes are getting fatter faster than the problems are getting larger. So today I can buy a 512Gb machine and tomorrow I can buy a 1Tb machine, which is sufficient to solve most problems.
There’s a huge market opportunity for Tb big data problems using non-clustered technology. Hadoop is solution for a giant filesystem or giant mapreduce. Since the disks are getting so big I don’t need a giant filesystem to be able to store the data in a distributed fashion for a lot of problems.
Simplify your architecture! No DB (unless you really need atomic updates; append-only does NOT count). No Hadoop (unless you really need high 10’s of Tbs and up data scales). Single machines are not hugely faster… but memory on 1 node has jumped up to low Tb counts. So single machine in-memory.
Similar Talks
Tracks
-
Architecting for Failure
Building fault tolerate systems that are truly resilient
-
Architectures You've Always Wondered about
QCon classic track. You know the names. Hear their lessons and challenges.
-
Modern Distributed Architectures
Migrating, deploying, and realizing modern cloud architecture.
-
Fast & Furious: Ad Serving, Finance, & Performance
Learn some of the tips and technicals of high speed, low latency systems in Ad Serving and Finance
-
Java - Performance, Patterns and Predictions
Skills embracing the evolution of Java (multi-core, cloud, modularity) and reenforcing core platform fundamentals (performance, concurrency, ubiquity).
-
Performance Mythbusting
Performance myths that need busting and the tools & techniques to get there
-
Dark Code: The Legacy/Tech Debt Dilemma
How do you evolve your code and modernize your architecture when you're stuck with part legacy code and technical debt? Lessons from the trenches.
-
Modern Learning Systems
Real world use of the latest machine learning technologies in production environments
-
Practical Cryptography & Blockchains: Beyond the Hype
Looking past the hype of blockchain technologies, alternate title: Weaselfree Cryptography & Blockchain
-
Applied JavaScript - Atomic Applications and APIs
Angular, React, Electron, Node: The hottest trends and techniques in the JavaScript space
-
Containers - State Of The Art
What is the state of the art, what's next, & other interesting questions on containers.
-
Observability Done Right: Automating Insight & Software Telemetry
Tools, practices, and methods to know what your system is doing
-
Data Engineering : Where the Rubber meets the Road in Data Science
Science does not imply engineering. Engineering tools and techniques for Data Scientists
-
Modern CS in the Real World
Applied, practical, & real-world dive into industry adoption of modern CS ideas
-
Workhorse Languages, Not Called Java
Workhorse languages not called Java.
-
Security: Lessons Learned From Being Pwned
How Attackers Think. Penetration testing techniques, exploits, toolsets, and skills of software hackers
-
Engineering Culture @{{cool_company}}
Culture, Organization Structure, Modern Agile War Stories
-
Softskills: Essential Skills for Developers
Skills for the developer in the workplace