The Dirty Data Problem

3 Challenges Companies Face as They Strive for Clean Data

“There is more data now than there has ever been at any point in history.” You can make this statement now, a week from now or two years from now, and it will always be true. At our current pace, it is estimated that we generate  2.5 quintillion bytes of data on a daily basis, and with the continued growth of IoT (Internet of Things) technologies that are constantly recording data, that number will multiply significantly in the years ahead.

From sensor data to financials, to social media analytics, point-of-sale data and more, companies’ data vaults are exploding with more data than they know what to do with. Cataloging and organizing this for analysis has historically been a back-burner project for many organizations, which has led to the current problem these companies face as they look to become more sophisticated in their analysis capabilities: clean data.

Fortunately, AI and analysis tools are catching up, and cleaning up data is less of a headache now than in was a couple of years ago. Still, it’s better to target the issues as quickly as possible, especially as your company onboards new technologies.

When people think of data cleansing, they usually think of common issues like weeding out duplicate records and fixing errors. However, even the most fastidiously kept records can suffer from innocent mistakes and oversights in the data recording process. This is especially common as companies scale in size.

Here are 3 of the most common issues we see as companies clean their data to achieve more meaningful analysis.


Label & Measure Consistency

Minor inconsistencies like spelling, date formatting, currency and imperial vs. metric measurements can be a pain when it comes to analyzing your data. You may end up merging one data set that is MM/DD/YYYY with a data set that is formatted DD-MM-YYYY. While it’s easy enough for a human to tell the difference, it will need to be reformatted in order to be optimal for your database.  

Similarly, minor inconsistencies in naming conventions can cause your analysis tool to read two separate items. One example would be “Midwest Region” and “MIDWEST REGION”. Or perhaps it’s a minor quirk in a customer ID “Tom’s” vs. “Tom’s Foods”. Simple differences in capitalization, consistency or innocent typos can lead to bad data outputs if you aren’t careful with the inputs entering your database. Garbage in, garbage out, as they say.

Multi-national companies often have to deal with additional issues such as weights measured in pounds vs. kilos and international exchange rates. These are all issues that can be easily overcome in the data cleansing process, but alignment as data is being recorded is much better from a cost and efficiency standpoint.


Unique Identifiers

Another easily overlooked data issue is having a unique product ID that can be used to blend different data types together. For many companies, especially smaller companies or companies with just a few products, this is a common pain point, but it happens to companies of all shapes and sizes.

For example, a CPG company might manufacture shampoo. Let’s call the shampoo “ReVital”, for the sake of this example. Perhaps they’ve recently decided to replace the shampoo with a new shampoo + conditioner formula called “ReVital Plus”. It will still occupy the same place in the product lineup. Does the new ReVital Plus formula need a new SKU?

The answer is 100% absolutely, yes.

Without a unique identifier for the product, how could you tell any differences between the new product and the old? If you wanted to analyze how the new ReVital Plus shampoo + conditioner affected your conditioner sales, it’s much easier to compare the old SKU and conditioner sales vs. the new SKU and conditioner sales. It’s much better to discontinue an old SKU than replace it with a new product.

Comparing Apples to Apples

The last clean data pain point arises when there is a disconnect between two sources of truth. How does this happen? Across different business units, product may be tracked differently, which can result in consistency issues that take some time and effort to correct.

For example, let’s say we make coffee beans. Sales might be tracking by units (24 oz bags), while logistics is tracking by weight (lbs). If you’re trying to figure out cost per pound in an automated analysis, you have a problem, since you aren’t comparing apples to apples. Some conversion will need to take place to analyze the data.

Another example would be looking at inventory across an organization. If one warehouse counts your product by case, and one counts your product by item, a human can easily look at that and say, “There are 24 items per case, we shipped 100 cases, that’s 2,400 units.” When you’re trying to automate the analysis, the system can’t make that call. You have to create the data for the system that will be analyzing it. It just takes a little reverse engineering.

If you’re starting to sweat, don’t worry. You’re not alone. Practically every company deals with these issues at one level or another at some point in time. Our team works with clients every day to solve these types of data science issues as they bring new information into Catalyst. We use a variety of tools to ensure data is clean, verified and tied out as our clients get up and running with our software. If you’re currently working through data issues and need an experienced partner to help you out, get in touch with us.

What kind of challenges are you facing when it comes to getting clean data? Do any of these challenges hit home for you? Sound off on social media now and join the conversation.

Facebook Twitter Linkedin