Presentation: Automatic Clustering At Snowflake
Share this on:
Abstract
For partitioned tables, maintaining good clustering properties for frequently filtered dimensions is critical for partition pruning and query performance. Naive methods of maintaining good clustering is usually expensive, especially when the clustering dimensions are different from the natural dimension with which the data is loaded. Usually the tradeoff between cost of reorganizing the data and benefit on the query time taper off after a certain point. Approximate clustering is cheaper to maintain while still resulting in good pruning performance. In this talk, I will present Snowflake’s clustering capabilities, including our algorithm for incremental maintenance of approximate clustering of partitioned tables, as well as our infrastructure to perform such maintenance automatically. I will also cover some real-world problems we run into and our solutions.
Similar Talks




Tracks
-
Security Transformation
How do you actually start with a security mindset? Learn techniques for making security a first-class concern.
-
Tech Ethics: The Intersection of Human Welfare & STEM
What does it mean to be ethical in software? Hear how the discussion is evolving and what is being said in ethics today.
-
Bare Knuckle Performance
Killing latency and getting the most out of your hardware.
-
Evolving Java & the JVM
6 month cadence, cloud-native deployments, scale, Graal, Kotlin, and beyond. Learn how the role of Java and the JVM is evolving.
-
The Right Language for the Job
We're polyglot developers. Learn languages that excel at very specific tasks and remove the undifferentiated heavy lifting in their specific domain.
-
Modern Operating Systems
Decompose the modern operating system, LinuxKit, Containers, Unikernals, eBPF, and more.
-
Architectures You've Always Wondered About
Ever wondered how they do it? Next-gen architectures from the most admired companies in software, such as Netflix, Google, BBC, Twitter, & more.
-
Modern CS in the Real World
Rediscover CS in this applied track on how research is affecting software today.
-
Architecting for Failure: Chaos, Complexity, and Resilience
Making systems resilient involves people and tech. Learn about strategies being used from chaos testing to distributed system clustering.
-
Architecting for the Cloud / Streaming Architectures
Cloud native architectures is a reality. Hear the war stories. learn the benefits, and dodge some of the pitfalls of running on the cloud.
-
JavaScript: Powering the Modern Web
Explore the frameworks that make JavaScript so popular, and learn how JavaScript-based languages are revolutionizing frontend (and backend) development.
-
Operationalizing Microservices: Design, Deliver, Operate
What's the last mile for deploying your service? Learn techniques from the world's most innovative shops on operating microservices.
-
“Don’t Mess Up The Culture!”—Scaling with Sanity
Culture is simply a shared way of doing something with passion. How do you maintain the culture as you scale?
-
DevOps & DevEx: Remove Friction, Ship Code, Add Value
Remove developer friction: CI/CD, fluent API, service meshes... anything that removes the friction in deploying & operating a system.
-
AI/Machine Learning without a PhD
AI/ML is more approachable than ever. Discover how deep learning and ML is being used in practice. Topics include: TensorFlow, TPUs, Keras, PyTorch, & more. No PhD required.
-
Surviving Uncertainty: GDPR, Brexit, or Politics? Beyond DR
With so much uncertainty, how do you bulkhead your organization and technology choices? Learn strategies for dealing with uncertainty today.
-
Career Hacking
Strategies for advancing the skills that advance your career. Look for mentoring, speaking, empathy, and career paths.
-
Advances in FinTech
Finance is king in London. What's happening and what should you be paying attention to with modern #FinTech