It’s the Journey — An Iterative, Process for Data Cleaning — innotescus

Data cleaning is treated most powerfully as an iterative process, one that starts with project goals, proceeds through inspection and cleaning, and arrives at insights…

  1. The fundamental steps in the data cleaning process
  2. Why treating the process as iterative can make your projects more successful
  3. How you can tailor the process to fit your organization’s structure and needs

A Framework for an Iterative Data Cleaning Process

Guidelines for Each Step

  1. Set Goals
  • Set data quality goals for validity, accuracy, completeness, consistency and uniformity. Foremost, keep the end-application in mind in order to distinguish a must-achieve benchmark from an acceptable range of results.
  • At the outset, create a data cleaning rulebook for the project. This guide will begin with goals, then capture detailed process guidelines and findings from each step in the cleaning cycle. This book becomes the heart of your iterative data cleaning methodology and a key component of the project documentation.
  • Choose a subset of data to work on and make sure that it is representative of the live data, at least during the first iterations. Choosing a representative data set requires careful thought. Methods include Simple Random, Systematic, Stratified, and Clustered. A truly random sample can be computing-intensive, while a non-random subset may be sufficient for early hypothesis exploration.
  • Explicitly allocate time to consider how biases may be introduced into your data, your model, or even your application goals. Bias identification requires much more of a right-brain, holistic exploration than a traditional data sampling review. As with each step in this process, document your findings.
  • Exploratory Data Analysis, or EDA, is the process of visualizing datasets from multiple angles to quickly get a holistic understanding of the data. It is an effective way to analyze then respond to the statistical characteristics your dataset presents. For example, you may discover the existence of outlier data points that are either erroneous and should either be removed from the data set, or should be explored further to uncover new, real-world situations reflected in the data.
  • Keep top of mind the 5 most common data cleaning needs: formatting, corruption, duplication, missing data and outliers as you probe for errant data points. Make each of the five categories a sub-section in your.
  • Request the eBook from Innotescus on techniques for resolving each of the 5 Common Machine Learning Data Cleaning Problems (listed in the previous bullet).
  • Assess the available code libraries and API sets that transform data cleaning into a more efficient, higher level task. Document your data cleaning technology choices in the rulebook.
  • Confirm that the data has been thoroughly cleaned, by comparing the data set to the validation conditions established in your rulebook during the goal definition phase. It’s possible to automate simple test cases for formats, boundary conditions, value ranges, etc.
  • Ensure that your data set is clean enough to be fed to a “naive model,” which is similar to your end product at its outset. Naive classification models do not possess at the outset any tuning (feature engineering) related to the underlying real-world situation; they essentially are a blank slate.
  • Document essential instructions in your data cleaning rule book. This includes a list of erroneous data sources that should be avoided.
  • Much as any scientist would, keep a record of insights and hypotheses you’ve developed during the current iteration, then plan for the steps you intend to take in the next round.
  • After completing several iterative cycles, your team should have settled on a well-documented, data cleaning rule book. The near-final book should contain a section describing the data cleaning effort involved in terms of types of work, skill sets needed, intensity and duration.
  • Your ultimate goal is to scale the project by delegating tasks from the most senior team members, who are breaking new ground, to other more cost-effective resources. These may include the annotation team or a third-party data cleaning service provider.

Improving the Process Through Iteration, Prioritization and Customization

--

--

--

Enabling better data, faster annotation, and deeper insights through innovative computer vision solutions.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Quantifying Risk in Finance: Expected Shortfall(ES) or Value at Risk(VaR)?

How to Deal with Messy Data

Technical Analysis Secrets #3: Moving Average Convergence Divergence

Mahalanobis Distance and Multivariate Outlier Detection in R

Get segmented bird’s-eye view of a scene from one frontal image

Is Data Science Certification Mandatory to get a Job as a Fresher?

How Can Ergonomics Affect the Experience

Accident Injury Severity Prediction with Machine Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Innotescus LLC

Innotescus LLC

Enabling better data, faster annotation, and deeper insights through innovative computer vision solutions.

More from Medium

All you should know about Python Numpy 1D Arrays

Training vs Validation vs Testing in Machine Learning

Training vs Testing vs Validation- Whats the difference?

Making your own Custom-Python-Executable

How to control elevation of geoshow() contour lines