Conference:March 6-8, 2017
Workshops:March 9-10, 2017
Presentation: Effective Data Pipelines: Data Mngmt from Chaos
Location:
- Churchill, G flr.
Duration
Day of week:
- Monday
Level:
- Intermediate
Persona:
- Data Scientist
Key Takeaways
- Analyze what best practices when approaching data pipeline architecture
- Decide what should I attempt to automate? What shouldn’t be automated?
- Determine how to prepare for inevitable failure when building data pipelines
Abstract
Creating automated, efficient and accurate data pipelines out of the (often) noisy, disparate and busy data flows used by today's enterprises is a difficult task. Data science teams and engineering teams may be asked to work together to create a management platform (or install one) that helps funnel these streams into the company's so-called data lake. But how are these pipelines managed? Who is in charge of maintaining services and reducing costs? How do we ensure data is not lost, not duplicated and is factually accurate? These concerns, among others, will be discussed alongside implementation decisions for those looking for a practical recommendation on the what and how of data automation workflows.
Interview
Primarily, I help clients with a variety of data science problems. I’ve worked with them to help implement complex automated workflow solutions, similar to what I’ll be talking about at the conference, to simple analytics reporting. I am passionate about ethical machine learning, natural language processing and creating data validation and cleaning to help ease the work on data scientists and allow them to more easily perform their job (rather than waste time wrangling data).
In the course of my career, I’ve had the opportunity to work on data pipeline tools and solutions that are well thought out as well as haphazardly put together and prone to breakage. I remember one job where I had to keep munin up all day every day to monitor when groups of servers would go down due to cron load. Having seen both sides, I think it’s a good perspective to have on what we can do properly and poorly when it comes to data pipelines; and when talking with other data engineers, I always learn new things. I enjoy speaking at conferences because it helps me meet other folks who care about similar topics and often sparks interesting conversations about how to use code to solve problems.
I’m hoping the talk will have some good takeaways for anyone who has to touch on the data ingestion or workflow process at a company. The goal is to give a broader overview of the problems, so that they aren’t one particular architecture or language specific. That said, there will be some more concrete examples to help talk about implementation details and how to perform best practices.
If you have never touched a data pipeline, it probably isn’t for you. That said, if you have ever helped build some data automation, you’ll likely have some frame of reference to access the materials of the talk.
As data engineers, we find ourselves either in a reactive state (i.e. everything is on fire and I’m the person to put them out) or in a preparation state (i.e. what can I do now to prevent future fires?). The goal of this talk is to create some perspective to help ease the reactive elements of the role and allow for optimizations that are actually useful (rather than premature). Thinking about the problems in a larger scale and in the scope of how others solve them can help spark new ideas and approaches or at least give you some perspective on how your automation is going at present.
I think the containerization of the world is producing some interesting side effects and growth that might not have been expected. I know that it’s definitely had an impact on data science, allowing data scientists with little operations or systems experience to quickly spin up and organize clusters for large scale data analysis. The interesting thing is that many of these containers are not necessarily written by data experts; so, for example, attempting to run a Hadoop cluster on docker instances may have a series of unintended performance issues. That said, I really like that the community is responding to issues with networking and security to help truly democratize large-scale container clusters for data science.
Similar Talks
Tracks
-
Architecting for Failure
Building fault tolerate systems that are truly resilient
-
Architectures You've Always Wondered about
QCon classic track. You know the names. Hear their lessons and challenges.
-
Modern Distributed Architectures
Migrating, deploying, and realizing modern cloud architecture.
-
Fast & Furious: Ad Serving, Finance, & Performance
Learn some of the tips and technicals of high speed, low latency systems in Ad Serving and Finance
-
Java - Performance, Patterns and Predictions
Skills embracing the evolution of Java (multi-core, cloud, modularity) and reenforcing core platform fundamentals (performance, concurrency, ubiquity).
-
Performance Mythbusting
Performance myths that need busting and the tools & techniques to get there
-
Dark Code: The Legacy/Tech Debt Dilemma
How do you evolve your code and modernize your architecture when you're stuck with part legacy code and technical debt? Lessons from the trenches.
-
Modern Learning Systems
Real world use of the latest machine learning technologies in production environments
-
Practical Cryptography & Blockchains: Beyond the Hype
Looking past the hype of blockchain technologies, alternate title: Weaselfree Cryptography & Blockchain
-
Applied JavaScript - Atomic Applications and APIs
Angular, React, Electron, Node: The hottest trends and techniques in the JavaScript space
-
Containers - State Of The Art
What is the state of the art, what's next, & other interesting questions on containers.
-
Observability Done Right: Automating Insight & Software Telemetry
Tools, practices, and methods to know what your system is doing
-
Data Engineering : Where the Rubber meets the Road in Data Science
Science does not imply engineering. Engineering tools and techniques for Data Scientists
-
Modern CS in the Real World
Applied, practical, & real-world dive into industry adoption of modern CS ideas
-
Workhorse Languages, Not Called Java
Workhorse languages not called Java.
-
Security: Lessons Learned From Being Pwned
How Attackers Think. Penetration testing techniques, exploits, toolsets, and skills of software hackers
-
Engineering Culture @{{cool_company}}
Culture, Organization Structure, Modern Agile War Stories
-
Softskills: Essential Skills for Developers
Skills for the developer in the workplace