Track:

Data Engineering : Where the Rubber meets the Road in Data Science

Location:

Churchill, G flr.

Duration

Duration:

10:35am - 11:25am

Day of week:

Monday

Level:

Intermediate

Persona:

Data Scientist

Key Takeaways

Analyze what best practices when approaching data pipeline architecture
Decide what should I attempt to automate? What shouldn’t be automated?
Determine how to prepare for inevitable failure when building data pipelines

Abstract

Creating automated, efficient and accurate data pipelines out of the (often) noisy, disparate and busy data flows used by today's enterprises is a difficult task. Data science teams and engineering teams may be asked to work together to create a management platform (or install one) that helps funnel these streams into the company's so-called data lake. But how are these pipelines managed? Who is in charge of maintaining services and reducing costs? How do we ensure data is not lost, not duplicated and is factually accurate? These concerns, among others, will be discussed alongside implementation decisions for those looking for a practical recommendation on the what and how of data automation workflows.

Interview

Question:

What is the focus of your work today?

Answer:

Primarily, I help clients with a variety of data science problems. I’ve worked with them to help implement complex automated workflow solutions, similar to what I’ll be talking about at the conference, to simple analytics reporting. I am passionate about ethical machine learning, natural language processing and creating data validation and cleaning to help ease the work on data scientists and allow them to more easily perform their job (rather than waste time wrangling data).

Question:

What’s the motivation for your talk?

Answer:

In the course of my career, I’ve had the opportunity to work on data pipeline tools and solutions that are well thought out as well as haphazardly put together and prone to breakage. I remember one job where I had to keep munin up all day every day to monitor when groups of servers would go down due to cron load. Having seen both sides, I think it’s a good perspective to have on what we can do properly and poorly when it comes to data pipelines; and when talking with other data engineers, I always learn new things. I enjoy speaking at conferences because it helps me meet other folks who care about similar topics and often sparks interesting conversations about how to use code to solve problems.

Question:

How you you describe the persona of the target audience of this talk?

Answer:

I’m hoping the talk will have some good takeaways for anyone who has to touch on the data ingestion or workflow process at a company. The goal is to give a broader overview of the problems, so that they aren’t one particular architecture or language specific. That said, there will be some more concrete examples to help talk about implementation details and how to perform best practices.

Question:

How would you rate the level of this talk?

Answer:

If you have never touched a data pipeline, it probably isn’t for you. That said, if you have ever helped build some data automation, you’ll likely have some frame of reference to access the materials of the talk.

Question:

QCon targets advanced architects and sr development leads, what do you feel will be the actionable that type of persona will walk away from your talk with?

Answer:

As data engineers, we find ourselves either in a reactive state (i.e. everything is on fire and I’m the person to put them out) or in a preparation state (i.e. what can I do now to prevent future fires?). The goal of this talk is to create some perspective to help ease the reactive elements of the role and allow for optimizations that are actually useful (rather than premature). Thinking about the problems in a larger scale and in the scope of how others solve them can help spark new ideas and approaches or at least give you some perspective on how your automation is going at present.

Question:

What do you feel is the most disruptive tech in IT right now?

Answer:

I think the containerization of the world is producing some interesting side effects and growth that might not have been expected. I know that it’s definitely had an impact on data science, allowing data scientists with little operations or systems experience to quickly spin up and organize clusters for large scale data analysis. The interesting thing is that many of these containers are not necessarily written by data experts; so, for example, attempting to run a Hadoop cluster on docker instances may have a series of unintended performance issues. That said, I really like that the community is responding to issues with networking and security to help truly democratize large-scale container clusters for data science.

Speaker: Katharine Jarmul

Python engineer, Founder @kjamistan

Katharine Jarmul is a Python engineer and educator based in Berlin, Germany. She runs a data science consulting company, Kjamistan, and offers several private and public courses on data automation, cleaning and acquisition. She has worked on data extraction and analysis since 2008. She offers several data science and engineering workshops and courses via Safari and other online partnerships. Her passions include natural language processing, ethical machine learning and data unit testing.

Find Katharine Jarmul at

Core Kafka team @Confluent

Ben Stopford

Causal Consistency For Large Neo4j Clusters

Chief Scientist @Neo4j

Jim Webber

Deliver Docker Containers Continuously on AWS

Lead Software Developer @AutoScout24

Philipp Garbe

Creating Space To Be Awesome

CTO who understands the science around helping people do their best

Meri Williams

Thinking Strategically About IoT

Senior Software Engineer @IBM, Committer on Apache Aries

Holly Cummins

In-Memory Caching: Curb Tail Latency with Pelikan

Distributed Systems Engineer Working on Cache @Twitter

Yao Yue

Observability, Event Sourcing and State Machines

Gold Badges Java, JVM, Memory, & Performance @StackOverflow / Lead developer of the OpenHFT project

Peter Lawrey

Assuring Crypto Code with Automated Reasoning

Research Lead, Software Correctness @Galois

Aaron Tomb

The Hitchhiker's Guide to Serverless Javascript

Director of Engineering @Bustle

Steve Faulkner

Tracks

Architecting for Failure

Building fault tolerate systems that are truly resilient
Architectures You've Always Wondered about

QCon classic track. You know the names. Hear their lessons and challenges.
Modern Distributed Architectures

Migrating, deploying, and realizing modern cloud architecture.
Fast & Furious: Ad Serving, Finance, & Performance

Learn some of the tips and technicals of high speed, low latency systems in Ad Serving and Finance
Java - Performance, Patterns and Predictions

Skills embracing the evolution of Java (multi-core, cloud, modularity) and reenforcing core platform fundamentals (performance, concurrency, ubiquity).
Performance Mythbusting

Performance myths that need busting and the tools & techniques to get there

Dark Code: The Legacy/Tech Debt Dilemma

How do you evolve your code and modernize your architecture when you're stuck with part legacy code and technical debt? Lessons from the trenches.
Modern Learning Systems

Real world use of the latest machine learning technologies in production environments
Practical Cryptography & Blockchains: Beyond the Hype

Looking past the hype of blockchain technologies, alternate title: Weaselfree Cryptography & Blockchain
Applied JavaScript - Atomic Applications and APIs

Angular, React, Electron, Node: The hottest trends and techniques in the JavaScript space
Containers - State Of The Art

What is the state of the art, what's next, & other interesting questions on containers.
Observability Done Right: Automating Insight & Software Telemetry

Tools, practices, and methods to know what your system is doing

Data Engineering : Where the Rubber meets the Road in Data Science

Science does not imply engineering. Engineering tools and techniques for Data Scientists
Modern CS in the Real World

Applied, practical, & real-world dive into industry adoption of modern CS ideas
Workhorse Languages, Not Called Java

Workhorse languages not called Java.
Security: Lessons Learned From Being Pwned

How Attackers Think. Penetration testing techniques, exploits, toolsets, and skills of software hackers
Engineering Culture @{{cool_company}}

Culture, Organization Structure, Modern Agile War Stories
Softskills: Essential Skills for Developers

Skills for the developer in the workplace

LAST YEAR'S SCHEDULE

Location:

Duration

Day of week:

Level:

Persona:

Key Takeaways

Abstract

Interview

Find Katharine Jarmul at

Similar Talks

Tracks

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World

Presentation: Effective Data Pipelines: Data Mngmt from Chaos

Location:

Duration

Day of week:

Level:

Persona:

More talks on:

Key Takeaways

Abstract

Interview

Find Katharine Jarmul at

Similar Talks

Tracks

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World