It’s the Journey — An Iterative, Process for Data Cleaning — innotescus

Data cleaning is treated most powerfully as an iterative process, one that starts with project goals, proceeds through inspection and cleaning, and arrives at insights…

  1. The fundamental steps in the data cleaning process
  2. Why treating the process as iterative can make your projects more successful
  3. How you can tailor the process to fit your organization’s structure and needs

A Framework for an Iterative Data Cleaning Process

Guidelines for Each Step

  1. Set Goals
  • Set data quality goals for validity, accuracy, completeness, consistency and uniformity. Foremost, keep the end-application in mind in order to distinguish a must-achieve benchmark from an acceptable range of results.
  • At the outset, create a data cleaning rulebook for the project. This guide will begin with goals, then capture detailed process guidelines and findings from each step in the cleaning cycle. This book becomes the heart of your iterative data cleaning methodology and a key component of the project documentation.
  • Choose a subset of data to work on and make sure that it is representative of the live data, at least during the first iterations. Choosing a representative data set requires careful thought. Methods include Simple Random, Systematic, Stratified, and Clustered. A truly random sample can be computing-intensive, while a non-random subset may be sufficient for early hypothesis exploration.
  • Explicitly allocate time to consider how biases may be introduced into your data, your model, or even your application goals. Bias identification requires much more of a right-brain, holistic exploration than a traditional data sampling review. As with each step in this process, document your findings.
  • Exploratory Data Analysis, or EDA, is the process of visualizing datasets from multiple angles to quickly get a holistic understanding of the data. It is an effective way to analyze then respond to the statistical characteristics your dataset presents. For example, you may discover the existence of outlier data points that are either erroneous and should either be removed from the data set, or should be explored further to uncover new, real-world situations reflected in the data.
  • Keep top of mind the 5 most common data cleaning needs: formatting, corruption, duplication, missing data and outliers as you probe for errant data points. Make each of the five categories a sub-section in your.
  • Request the eBook from Innotescus on techniques for resolving each of the 5 Common Machine Learning Data Cleaning Problems (listed in the previous bullet).
  • Assess the available code libraries and API sets that transform data cleaning into a more efficient, higher level task. Document your data cleaning technology choices in the rulebook.
  • Confirm that the data has been thoroughly cleaned, by comparing the data set to the validation conditions established in your rulebook during the goal definition phase. It’s possible to automate simple test cases for formats, boundary conditions, value ranges, etc.
  • Ensure that your data set is clean enough to be fed to a “naive model,” which is similar to your end product at its outset. Naive classification models do not possess at the outset any tuning (feature engineering) related to the underlying real-world situation; they essentially are a blank slate.
  • Document essential instructions in your data cleaning rule book. This includes a list of erroneous data sources that should be avoided.
  • Much as any scientist would, keep a record of insights and hypotheses you’ve developed during the current iteration, then plan for the steps you intend to take in the next round.
  • After completing several iterative cycles, your team should have settled on a well-documented, data cleaning rule book. The near-final book should contain a section describing the data cleaning effort involved in terms of types of work, skill sets needed, intensity and duration.
  • Your ultimate goal is to scale the project by delegating tasks from the most senior team members, who are breaking new ground, to other more cost-effective resources. These may include the annotation team or a third-party data cleaning service provider.

Improving the Process Through Iteration, Prioritization and Customization

--

--

--

Enabling better data, faster annotation, and deeper insights through innovative computer vision solutions.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

The “local search” problem — part 1

Data Science Project Explanation to Recruiter

Predicting market moves

Debunking Fines Migration in Espresso

Learning Tableau: Graphs and Player Dashboards

The A/B testing culture — a guide about how to design successful online controlled experiments…

Bringing analytics to data products rapidly with ChartFactor

Common Excel Functions in Julia

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Innotescus LLC

Innotescus LLC

Enabling better data, faster annotation, and deeper insights through innovative computer vision solutions.

More from Medium

Morden Technology Development — Technologies for better life

How to convince yourself and others that your DL models could complete training process before your…

Compute intensities of interested regions in a fluorescence image stack

What is AI Image Recognition and How Does it Work?