Presentation: Data Cleansing and Understanding Best Practices



11:50am - 12:40pm

Day of week:




Any data scientist who works with real data will tell you that the hardest part of any data science task is the data preparation. Everything from cleaning dirty data to understanding where your data is missing and how your data is shaped, the care and feeding of your data is a prime task for the working data scientist.

I will describe my experiences in the field and present some useful open source software to automate some of the necessary but insufficient things that I do every time I'm presented new data. In particular, we'll talk about discovering missing values, values with skewed distributions and discovering likely errors within your data, as well as a novel approach at finding data interconnectedness based on usage using unsupervised learning.

I will describe the impact of these lessons to team construction and how to avoid some of the most painful lessons.

Speaker: Casey Stella

Committer and PMC member on the Apache Metron project

I am a committer and PMC member on the Apache Metron project in the engineering team at Hortonworks. In the past, I've worked as an architect and senior engineer at a healthcare informatics startup spun out of the Cleveland Clinic, as a developer at Oracle and as a Research Geophysicist in the Oil & Gas industry. I specialize in writing software and solving problems where there are either scalability concerns due to large amounts of traffic or large amounts of data. I have a particular passion for data science problems or any thing mathematical.

Find Casey Stella at

Similar Talks


Conference for Professional Software Developers