Many more ideas/projects are forthcoming. Let’s build some cool stuff together!

Improve EDA with Data Comparator

Improve EDA with Data Comparator

Once upon a time, my team and I were faced with the task of optimizing a critical section of our data pipeline. The ultimate objective was to convert a series of SAS scripts to Python; specifically for use with PySpark. In the course of this conversion, I found myself constantly comparing the old SAS-driven outputs with the Spark-driven outputs in order to validate the new code.

I repeated a similarly clunky comparison workflow for several other projects before my lazy-spidey senses told me that this ripe for automation. The existing data profiling tools that I researched at that time were largely pandas-based and “heavier” than what I preferred, hence I developed the Data Comparator tool to account for those preferences. Let’s get to some action first, and then I’ll explain a bit of my rationale behind the tool.

Loading Data

We first load some historical Olympic data scraped from www.sports-reference.com and set it up to compare all identically-named columns:

load_data.gif

We can drag the data into the columns list or use Select File dialog to load the data. Data Comparator currently supports parquetcsvsas7bdat, and json files (as well as Spark and Pandas DataFrames via the command line). The various data types are converted into a Pandas DataFrame. Smaller data files are ideal, but for larger data (few GBs or more), it’s best to try to limit the scope of the dataset by playing around with the Input Params dialog or using a subset of that data.

Profiling

In the Data Details tab, we take a quick look at the data in tabular form and it looks reasonable. We click into the Dataset Details dialog and notice that the ID column was read as a numeric value — we may want to convert that to a string at some point.

 
profiling.png
 

Comparison

We make a few comparisons of the data between the 2014 Winter games and 2016 Summer games and everything seem consistent so far… until we get to the Weight column.

 
comp2.png
 

Are 2016 Summer participants really that much heavier than 2014 Winter participants or are the datasets using different units of measurement?

 
comp3.png
 

We’ve set up and performed validations for the Names column and they look pretty messy: Some names in all caps, some with nicknames, and some names are valid but long (shoutout to the gloriously-named Hubertus Rudolph von Fürstenberg-von Hohenlohe-Langenburg).

 
Screen Shot 2021-01-01 at 7.51.29 PM.png
 

With Data Comparator, we can customize the validation criteria for each data type to suite our curiosity.

 
Screen Shot 2021-01-01 at 8.00.56 PM.png
 

Conclusion

Data Comparator was designed to be a lightweight EDA tool that complements traditional EDA techniques — not a replacement. The customizability and on-the-fly convenience of this tool is the selling point that keeps me using it regularly. If used intelligently, Data Comparator can save a lot of time by avoiding running into these trivial data issues downstream and by avoiding having to spend extra time blindly auditing the data.

I intend on adding some of the following updates in the future:

  • Export a “Hightlights” report to HTML which can be shared with team members

  • Better search algorithm for StringType validation checks

  • A menu bar complete with a series of cool options

  • Etc.

Github Repohttps://github.com/culight/data_comparator


Automation Survey Visualization