Presentation: The mechanics of testing large data pipelines

Location:

Duration

Duration: 
11:50am - 12:40pm

Day of week:

Abstract

Applied machine learning data pipelines are being developed at a very fast pace and often exceed traditional web/business applications codebase in terms of scale and complexity. The algorithms and processes these data workflows implement fulfill business-critical applications which require robust and scalable architectures. But how to make these data pipelines robust? When the number of developers and data jobs grow while at the same time the underlying data change how do we test that everything works as expected?

In software development we divide things in clean, independent modules and use unit and integration testing to prevent bugs and regression. So why is it more complicated with big data workflows? Partly because these workflows usually pull data from dozens of sources out of our control and have a large number of interdependent data processing jobs. Also, partly because we don't know yet how to do or lack the proper tools.

This talk will explore the mechanics of testing large, complex data workflows and identify the most common challenges developers face. We'll look at good practices how to develop unit, integration, data and performance testing for data workflows. In terms of tools, we'll look at what exists today for Hadoop, Pig and Spark with code examples.

Tracks

Covering innovative topics

Monday, 7 March

Tuesday, 8 March

Wednesday, 9 March

Conference for Professional Software Developers