A Practical Framework for Profiling Messy Business Data

Most small and medium-sized businesses are sitting on more data than they realize – exports from operational systems, spreadsheets passed between teams, flat files pulled “just in case”. The problem isn’t a lack of data or even a lack of technical skill. It’s not knowing where to start, or how to turn a messy dataset into something usable without over-engineering the solution.

This article is written for analysts who are in that in-between space: you may be comfortable in spreadsheets, or you may have dabbled in SQL, but applying those skills to a brand-new, imperfect dataset feels overwhelming. You know the data has value, but you’re missing a practical framework for assessing it’s quality, understanding it’s limitations, and creating a solution that eliminates manual processing.

In this post, I’ll walk through an example of what a real-world data profiling exercise would look like using a dataset that closely mirrors what you’d encounter in practice. I’ll show what I check for, why those checks matter, and how I approach a brand new dataset to help a business make use of neglected data. Along the way, I’ll demonstrate how to perform analysis and produce a clear, shareable report at the same time – using only basic SQL.

To make the process repeatable, I’ve also included a downloadable data profiling checklist you can use as a reference – whether you’re exploring a dataset for the first time or formalizing your own approach.

Data Profiling Checklist – BaroodyDataSolutions Download

You can view a sample of our final result here: Real Estate Data Profiling Report Sample – BaroodyDataSolutions – a great example of how helpful a data profiling report can be when it comes to gaining thorough understanding of a dataset in preparation for future analysis.

Problem

Raw datasets are often messy and hard to use
The data is hard to understand or incomplete
Need a framework for systematically assessing the quality of the data
- understanding the flaws of the dataset
- starting the data discovery process so the data can be put to use
Assemble findings into an easily digestible report
Shareable, understandable, repeatable
Minimal coding skills required
- Basic SQL skills
- flexibility of tooling

Approach

Before building dashboards or reports, it’s important to understand how much the underlying data can support meaningful analysis. Exploratory data analysis and profiling helps us identify gaps, inconsistencies and missing data early on before they turn into misleading metrics and other issues down the road.

The process we will follow to perform this analysis is:

Follow the BDS data profiling checklist (download above)
Perform assessment of each table (or file)
create a wire-frame (based on the checklist) specifically tailored to the data under evaluation
use the wire-frame to create an evidence.dev notebook and perform the data profiling exercise
- as the analysis is performed in evidence, a report is being created with values, tables and charts that paint a full picture of the quality of the data

Data

The example used in this project consists of fictitious flat-file extracts from a real estate management system. The dataset was intentionally chosen to reflect the kinds of conditions commonly encountered in real operational data – useful, but imperfect.

Tables including leases, properties, maintenance requests, tenants, etc.

The data set mimics real world data
- missing data
- disorganized
- inconsistent formatting
- missing links between tables

Techniques

Logical Duplicate Detection Using Business Rules

holder for paragraph explaining what and why

Checking primary key uniqueness
Confirming entities are logically unique
- i.e. overlapping active dates for same property_id and tenant_id
- catching scenarios where:
  - lease record was re-entered instead of updated
  - multiple versions of the same lease exist
  - system issues causing overlapping results

holder for a code example

Takeaway

duplicates are not always obvious in standard reports

verifying the logic behind the data leads to accurate reports and can also fix upstream processes

Segment Data Into Key Status Groups

holder paragraph for explaining what and why