Presentation: Effective Data Pipelines: Data Mngmt from Chaos



10:35am - 11:25am

Day of week:



Key Takeaways

  • Analyze what best practices when approaching data pipeline architecture
  • Decide what should I attempt to automate? What shouldn’t be automated?
  • Determine how to prepare for inevitable failure when building data pipelines


Creating automated, efficient and accurate data pipelines out of the (often) noisy, disparate and busy data flows used by today's enterprises is a difficult task. Data science teams and engineering teams may be asked to work together to create a management platform (or install one) that helps funnel these streams into the company's so-called data lake. But how are these pipelines managed? Who is in charge of maintaining services and reducing costs? How do we ensure data is not lost, not duplicated and is factually accurate? These concerns, among others, will be discussed alongside implementation decisions for those looking for a practical recommendation on the what and how of data automation workflows.


What is the focus of your work today?

Primarily, I help clients with a variety of data science problems. I’ve worked with them to help implement complex automated workflow solutions, similar to what I’ll be talking about at the conference, to simple analytics reporting. I am passionate about ethical machine learning, natural language processing and creating data validation and cleaning to help ease the work on data scientists and allow them to more easily perform their job (rather than waste time wrangling data).

What’s the motivation for your talk?

In the course of my career, I’ve had the opportunity to work on data pipeline tools and solutions that are well thought out as well as haphazardly put together and prone to breakage. I remember one job where I had to keep munin up all day every day to monitor when groups of servers would go down due to cron load. Having seen both sides, I think it’s a good perspective to have on what we can do properly and poorly when it comes to data pipelines; and when talking with other data engineers, I always learn new things. I enjoy speaking at conferences because it helps me meet other folks who care about similar topics and often sparks interesting conversations about how to use code to solve problems.

How you you describe the persona of the target audience of this talk?

I’m hoping the talk will have some good takeaways for anyone who has to touch on the data ingestion or workflow process at a company. The goal is to give a broader overview of the problems, so that they aren’t one particular architecture or language specific. That said, there will be some more concrete examples to help talk about implementation details and how to perform best practices.

How would you rate the level of this talk?

If you have never touched a data pipeline, it probably isn’t for you. That said, if you have ever helped build some data automation, you’ll likely have some frame of reference to access the materials of the talk.

QCon targets advanced architects and sr development leads, what do you feel will be the actionable that type of persona will walk away from your talk with?

As data engineers, we find ourselves either in a reactive state (i.e. everything is on fire and I’m the person to put them out) or in a preparation state (i.e. what can I do now to prevent future fires?). The goal of this talk is to create some perspective to help ease the reactive elements of the role and allow for optimizations that are actually useful (rather than premature). Thinking about the problems in a larger scale and in the scope of how others solve them can help spark new ideas and approaches or at least give you some perspective on how your automation is going at present.

What do you feel is the most disruptive tech in IT right now?

I think the containerization of the world is producing some interesting side effects and growth that might not have been expected. I know that it’s definitely had an impact on data science, allowing data scientists with little operations or systems experience to quickly spin up and organize clusters for large scale data analysis. The interesting thing is that many of these containers are not necessarily written by data experts; so, for example, attempting to run a Hadoop cluster on docker instances may have a series of unintended performance issues. That said, I really like that the community is responding to issues with networking and security to help truly democratize large-scale container clusters for data science.

Speaker: Katharine Jarmul

Python engineer, Founder @kjamistan

Katharine Jarmul is a Python engineer and educator based in Berlin, Germany. She runs a data science consulting company, Kjamistan, and offers several private and public courses on data automation, cleaning and acquisition. She has worked on data extraction and analysis since 2008. She offers several data science and engineering workshops and courses via Safari and other online partnerships. Her passions include natural language processing, ethical machine learning and data unit testing.

Find Katharine Jarmul at

Similar Talks

CTO who understands the science around helping people do their best
Senior Software Engineer @IBM, Committer on Apache Aries
Distributed Systems Engineer Working on Cache @Twitter
Gold Badges Java, JVM, Memory, & Performance @StackOverflow / Lead developer of the OpenHFT project
Research Lead, Software Correctness @Galois


Conference for Professional Software Developers