Once upon a time, my team and I were faced with the task of optimizing a critical section of our data pipeline. The ultimate objective was to convert a series of SAS scripts to Python; specifically for use with PySpark. In the course of this conversion, I found myself constantly comparing the old SAS-driven outputs with the Spark-driven outputs in order to validate the new code.

I repeated a similarly clunky comparison workflow for several other projects before my lazy-spidey senses told me that this ripe for automation. The existing data profiling tools that I researched at that time were largely pandas-based and “heavier” than what I preferred, hence I developed the Data Comparator tool to account for those preferences. Let’s get to some action first, and then I’ll explain a bit of my rationale behind the tool.

Loading Data

We first load some historical Olympic data scraped from www.sports-reference.com and set it up to compare all identically-named columns:

We can drag the data into the columns list or use Select File dialog to load the data. Data Comparator currently supports parquet, csv, sas7bdat, and json files (as well as Spark and Pandas DataFrames via the command line). The various data types are converted into a Pandas DataFrame. Smaller data files are ideal, but for larger data (few GBs or more), it’s best to try to limit the scope of the dataset by playing around with the Input Params dialog or using a subset of that data.

Profiling

In the Data Details tab, we take a quick look at the data in tabular form and it looks reasonable. We click into the Dataset Details dialog and notice that the ID column was read as a numeric value — we may want to convert that to a string at some point.

Comparison

We make a few comparisons of the data between the 2014 Winter games and 2016 Summer games and everything seem consistent so far… until we get to the Weight column.

Are 2016 Summer participants really that much heavier than 2014 Winter participants or are the datasets using different units of measurement?

We’ve set up and performed validations for the Names column and they look pretty messy: Some names in all caps, some with nicknames, and some names are valid but long (shoutout to the gloriously-named Hubertus Rudolph von Fürstenberg-von Hohenlohe-Langenburg).

With Data Comparator, we can customize the validation criteria for each data type to suite our curiosity.

Conclusion

Data Comparator was designed to be a lightweight EDA tool that complements traditional EDA techniques — not a replacement. The customizability and on-the-fly convenience of this tool is the selling point that keeps me using it regularly. If used intelligently, Data Comparator can save a lot of time by avoiding running into these trivial data issues downstream and by avoiding having to spend extra time blindly auditing the data.

I intend on adding some of the following updates in the future:

Export a “Hightlights” report to HTML which can be shared with team members
Better search algorithm for StringType validation checks
A menu bar complete with a series of cool options
Etc.

Github Repo: https://github.com/culight/data_comparator

Many more ideas/projects are forthcoming. Let’s build some cool stuff together!

Jan 2 Improve EDA with Data Comparator

Loading Data

Profiling

Comparison

Conclusion

I’ve collaborated with some amazing teams to deliver:

High-quality mobile/web applications for a diverse range of clients

Robust firmware for telecommunications equipment

Implementation and support for ETL processes for an enterprise-level data pipeline

(Whatever the requirements call for)

Many more ideas/projects are forthcoming. Let’s build some cool stuff together!

Many more ideas/projects are forthcoming. Let’s build some cool stuff together!

Jan 2 Improve EDA with Data Comparator

Loading Data

Profiling

Comparison

Conclusion

Dec 17 Automation Survey Visualization

I’ve collaborated with some amazing teams to deliver:

High-quality mobile/web applications for a diverse range of clients

Robust firmware for telecommunications equipment

Implementation and support for ETL processes for an enterprise-level data pipeline

(Whatever the requirements call for)

Many more ideas/projects are forthcoming. Let’s build some cool stuff together!