More Than Reports

Dirty Data

In this episode, we are going to be talking about dirty data. This is a topic that I think every company on the planet must deal with in one way or another. It’s something that’s critical when it comes to capturing the value that we’re going after with Reportopia solutions. We need to face this head on. First, what is dirty data? How do we define it? What are the different types of it? After we answer those questions, we’re going to move into what “dirty data” means for reporting. What’s the impact of dirty data on your reporting solution and ultimately on the ability to capture value? Finally, we’re going to talk about how we deal or decide not to deal with dirty data issues.

I want to start by saying that every company out there has some level of data quality issues. It’s across the board. Having done this for a couple of decades now, I can confidently say that there is not a company on the planet of any significant size is going to have some type of dirty data to deal with unless you’re just getting started out and you haven’t started recording any data. We can’t let dirty data stop us from capturing value opportunities or improving our business processes. And I’ve seen that happen. I think it was last year, we were working with a prospect and getting ready to move forward with building their solution. The company decided to not move forward, unfortunately. Their reason was that they didn’t feel like their data source was clean enough. They just felt like they had data quality issues and didn’t want to move forward with the solution. That’s really a shame, because this is all about capturing value.

If we see that there is a value opportunity out there and we can quantify that value, then that amount of value is what determines how much investment is appropriate to capture the value, right? It’s not as if we should say there’s a hundred thousand dollars in value out there, but I’ve got dirty data over here. It’s going to cost me $5,000 to clean this thing up. So, I’m not going to go after the hundred-thousand-dollar opportunity, that doesn’t make sense. It also doesn’t mean you should turn your head to the dirty data issue because it is a real thing.

The other part of the value opportunity capturing scenario is that sometimes dirty data could prevent you from capturing a hundred percent of that value. Maybe we just simply decide we’re going to capture 80% or 90% of the value, and we’re going to let the other 10% go because it would be too expensive to do the things that need to be done in order to capture that final 10%. So, you can look at this in several different ways.

Another way to look at the cost of dealing with dirty data is it will often have some impact on multiple, maybe many value opportunities. If you fix this problem, it could cascade into helping you capture value opportunities in several different ways. The point is don’t let dirty data stop you from building your Reportopia. That would be a catastrophe. Dirty data is a problem, but it’s not an insurmountable problem.

A couple of years ago, I wrote a blog post just as an aside, it was called the dirtiest of all dirty jobs, data warehousing. I’m still in the trenches. I still work with clients every day and I still see these problems time and time again. I wrote this post and outlined a few of the typical issues that we see. It’s kind of a rant, I guess you can say. It uses Mike Rowe’s dirty jobs show, which I love that show by the way, it went off air, but there’s now a new season, I believe. I haven’t seen it yet, but I believe there’s a new season that’s starting in 2022. Anyway, this blog post goes through a few of the common types of problems that we see in these business applications when it comes to dirty data, you can go check it out, but I’m going to outline a few of those things now.

First, what is dirty data? Dirty data can take several different forms, there’s technical definitions, but I’m not going to go into that all. I’m talking to the business owners and the process managers. I want to talk about this from your standpoint, what exactly are we saying when we say your tech guy comes back to you and says, I’m sorry, but I can’t pull this report together the way you want, because it has dirty data. What exactly are they saying? They could be saying a few different things.

First, we have a category of dirty data that’s related to what I’m going to call unvalidated data entry. This is where you have your business application. It provides you with some free form text entry, it could be anything, it could be typing in somebody’s name. It could be typing in an address. It could be typing in a note, it could any of those things. That free form type of entry where someone can just type whatever they want will result in people typing whatever they want. That’s obvious. When you get enough people involved, you’re going to see that free form types of entry are going to have everything you can imagine. Some of them will have nothing. Some of them will have just a dash. Some will have a space. Some will have all kinds of weird characters. People will copy and paste stuff there. A lot of it will be what you would expect, but there will be unvalidated data. If we try to use that information in reporting, we’re going to see all those dirty data values.

Another type of dirty data is what I call the multiple business system problem. This is a very common issue. You have multiple business systems and they’re tracking something in common, but they’re not integrated. A typical example is if you have your ERP system, which has a list of all your customers in it, and then you have a CRM (customer relationship management system), and it also has a list of customers in it or prospects in it, which turn into customers. But the two are not technically integrated. In other words, someone’s at some point typing in a person’s information, both into the CRM and into the ERP system. Well, when that happens, you don’t have any real reliable way of tying these two systems together because there’s no common key. Instead, you’re left to try to match them using other types of attributes, like the person’s first name plus last name. And there’s all sorts of problems with that. This ends up being what I would call a dirty data issue. You could also call it a data integration issue, but in the end, you’ve got information that’s going to be very difficult with a hundred percent reliability to solve that problem.

Another category that results in dirty data is what I am going to call business process changes. For example, you can have a business process like appointment setting. Your business sets an appointment with someone. If you just have an appointment that you’re setting and you’re naming those appointments, you may even have a validated dropdown list in your application that allows you to select the appointment name. That works great. People can only select the appointment name that is available in the dropdown. It’s a valid value. A good business application is tracking those values with a unique identifier in the background. It works wonderfully. If at some point your business process changes such that you change those appointment type names, maybe you started with three and now you want to have 10. Now you’ve got a point in time where value means one thing, and then another point in time where that same value means something else. It’s a business process change that is not uncommon at all. Often, we can identify when these changes occur and deal with it, but it does create a dirty data issue. You can imagine trying to report on top of this information and you’re going to see very large changes at the time that these business processes were implemented.

Another type of dirty data issue is what I’ll call a data migration issue. This happens when you see acquisitions and mergers. The acquisitions result in data migrations and internal business application changes. Let’s talk about this for a second. If your business is acquiring another entity, there’s a good chance that other entity already has a business application in place. You’re going to want that information, hypothetically. You’re going to want that information migrated over to your business application, so you can deal with them in the same way you deal with your current business. It just makes sense to do that migration. What happens in these migrations is you end up having to map the data that was captured in the original business application, to the elements that are available in the new application. Ultimately that’s not a one-to-one mapping. It just isn’t going to be because these are two different business applications in this case. Even if it’s the same business application, you’re still going to have problems because they will be configured differently in most cases. So, you end up making compromises. You decide that you’re going to take a subset of the source data and push it into your business application. You try to perform transformations so that the source data aligns to your current business applications values. That all ends up with some level of dirty data issues.

That same situation exists when you change business applications. We were talking about mergers and acquisitions, but the same situation will happen if you try to move from the system that you’re on. If your system is insufficient to deal with the problems and the challenges that you’re faced with, you decide to move to another system. You need to move your original data into this new system. So that’s going to end up with some level of dirty data.

Another category is what I’m just going to call the logical gaps. You could call this the poorly designed business application database section of dirty data issues. It’s not so much that the business-facing people have done something wrong or made some compromise in how data is being entered. It’s more that the business application itself has some challenges. It wasn’t designed to deal with certain issues. One of those things is simply overriding values when you would expect that values are being retained through time.

For example, you may have a customer that lives in New York, and then you sell that customer something. Then later that customer moves to California. So now this customer at one point in time, lived in New York when they bought the original item from you, and then later, they moved to California, and they bought something else from you. If you want to know how much sales volume you have coming out of New York and how much you have coming out of California, you will have to know point in time that person lived in New York and bought widget A and then moved to California and bought widget B. To know that your business application would have to record this address information, in this example, at point in time.

Oftentimes we find that business applications do not store that level of history, instead, they just override the values in place so that customer address is represented as our current address. If you ignore that problem, and we start looking at sales by location, well, now all the New York sales look like they happen in California. There are solutions to this as well, but that is an example of a logical gap.

Another example is, what I would call poor design. That’s just not handling keys properly. This is a bit technical, but not keying things properly, such that we must end up relying on text to join between different entities within a database and overall, poorly designed application. We’ll wrap that all up into poorly designed applications. Most applications are not this way. Most applications from my view, and we work with a lot of applications, have a decent handle on how to design their database so that relationships are handled properly, and constraints are in place to make sure that the overall database has integrity. Nevertheless, there are certain cases where you run into logical gaps. So, that’s not every possible scenario where dirty data can happen, but those are the big ones.

What does this mean for reporting? We know we have data quality issues or dirty data issues. What does that mean? It may be obvious, but I’m going to say it anyway. It could mean anything from nothing to a very critical business impacting error. If we’re talking about that example that I gave, where we’re trying to look at sales and figure out where the sales occurred and we’re only tracking current address, well, what’s the actual impact? How do we quantify the impact of that? First, while it is serious, that problem is only going to rise when someone buys from point A and moves to point B. And how many of the customers that you’re dealing with have that situation?

If we say that 20% of your customer base has that situation where they’ve bought from more than one location. Well, now we’re starting to understand how critical this is from a reporting standpoint. Then we ask ourselves, how important is that if 20% of the people moved from point A to point B? How critical is that? What is the impact on the reporting? I don’t know the answer to that question. It’s only you that can answer that question on how important it is to your reporting. That’s one way.

Another example, I might have mentioned this last week, but I had an issue with a client. It is still kind of an ongoing issue. We have a process where we were posting credit card payments for a customer, we get the payments in, we get some evidence of a payment, and then we post that information. We end up posting to this client’s financial system. What happened is interesting, we ended up getting dirty data. In this case, someone that’s making a payment is entering the order number that they’re paying on. So, they’re entering an order number. They should have this order number on their invoice, they type it in and then provide their credit information to make the payment. What happens a lot of times is the person will only enter their account number. It is hard to determine what patient account are we talking about versus which order they’re paying on. Also, in some cases, we have order numbers that have an equivalent account number. So, if someone has an account number 5437, there happens to also be an order number 5437.

We know that people are putting in the wrong information. They’re just typing it in. This is the unvalidated data entry type of situation. How do we know if they’re providing us with an account number or invoice number? So, we could post directly to the provided order number. That’s the example of the problem that this dirty data entry can create.

What is the actual implication on the value opportunity? Well, it’s insignificant. We’re only talking about a couple hundred dollars. This is a very edge case. In one way you could say this looks bad. We don’t want to post a payment to the wrong account, on the wrong order. But on the other side of things, if you remove the automation and you start taking on all the overhead required to manage this process internally, you might be talking about a 10 X difference. In other words, if you just let this error continue, which is not what we decided to do, by the way, we were able to find a way to resolve this. But let’s just say, we let this error continue and you were ending up posting payments to the wrong account. What is the implication of that? Well, ultimately somebody is going to tell you that they made a payment, and that payment was not posted to their account. This means someone’s going to have to go research and figure it out. They will see a payment from that person, but it was not posted to their account.

Let’s just say, they’re going to write that amount of money off. Well, in this case, the amount of money we’re talking about is insignificant. It’s just a credit card payment for things like copays and that kind of stuff. If that was the cost to the business for this error, the value opportunity being captured outweighs that cost. It just makes sense to go ahead and accept it if that was the situation. What this means to reporting? It could be all sorts of things.

It could be from nothing to very serious problems. I would say we always got to look at what exactly means to the value opportunity we’re trying to capture. You’re going to find problems. Inevitably, there are going to be problems of all the different types that we mentioned. You’re going to have to deal with some of them. It doesn’t make sense to just say, “We’re not going to go after this, a hundred-thousand-dollar value opportunity because we have this $10,000 problem.” You want to capture the $90,000 and then decide that $10,000 problem is worth dealing with.

How do we deal with dirty data issues? Well, I’ve broken this down into few different categories. The first category is what I’m going to call, stick your head in the sand method. We just don’t deal with it. That may be the right answer. We’re not going to deal with it. We’re going to accept the impact of this dirty data. From a business user standpoint, if you’re trying to run a process and make something more efficient, you may knowingly accept that the report data you’re getting. The data you’re getting on your reports is not 100% accurate or 100% complete. It may be acceptable to live with.

Another way of dealing with this, if you’re getting this report that is inaccurate is to simply correct data in the source. We call business applications “systems of record” on my side of the world. A business application is a system of record and data originates in systems of record. In any data element, if you find that you have a report that’s showing a value that’s unexpected and is known to be incorrect, you can go back into the source system, correct it there and then the next day, your report is accurate. That’s one way of dealing with it. Basically, letting the end users deal with it via the system of record.

Another way of dealing with it is just to avoid the problem. If we know there’s dirty data in a data element, and that data element is not critical to capturing the value opportunity we’re working with. Let’s just ignore it. Let’s not deal with it at all. Let’s avoid that element all together. This whole section is about not dealing with the problem. We’re not going to deal with it basically. This can be not dealing with it because we’re going to ignore it. We know it exists, but we’re going to ignore it or we’re going to deal with it via the system of record. We’re going to correct the system of record, which is wonderful. We love it when that’s a possibility or it could mean that we didn’t even know it existed.

It gets to be a little sticky, because if you don’t know that dirty data exists and you just have one bad element of data in your reports, that creates a serious problem for the value opportunity. If you remember last week, we talked about trust and it’s important that users can trust these reports. When they find that there’s some incorrect value in a report, that diminishes their trust. That can have a direct impact on the ability to capture the value we’re faced. So, we need to be careful when we say we’re just going to not deal with the problem.

The next category of how we deal with dirty data is what I’m going to call the knee jerk reaction. This is where a business user will look at a report and find that there’s some unexpected value. They’re wondering why the report is saying this. We know this is incorrect. Then they ask for it to be corrected. And that creates this fire drill situation. The report is wrong. It must be a technical problem. Send it back to the data team, get them to correct this report. The data team investigate it and decide it is a data quality issue. They can trace it back to the source system. The business user wants this information to look a certain way, not the way that the system of record’s storing it. They want it to look at a different way, the way they want it to look. That is the fire drill approach or the knee jerk approach.

This is not a good way to solve the problem, because it’s not sustainable. It doesn’t solve the core problem. Whenever we just say, oh, there’s unexpected values here. It’s not something that we want to see in reports. Let’s go deal with it on a one-by-one basis in our data repository. That’s not a good approach. It does happen though. The knee jerk reaction approach falls into two categories as well. You do have people that do the knee jerk approach and they’re genuinely surprised by these data values that are unexpected. Then you have people that do the knee jerk reaction that are aware of the problems that exist in the source system. But for some reason they just felt like, probably from not knowing how the process works, their problem would be solved somehow in the reporting layer or in the data solution layer.

The knee jerk reaction is quite common, unfortunately, it does happen. When it happens, it’s important that we not avoid the problem. We do need to take care of the immediate problem, but also reset. We need to go look at what is a sustainable solution. What is the expected result here? And let’s design something that works.

Another way of handling this problem is through exception reporting. The third way on how we deal with dirty data. The exception report is like what I was talking about in the knee jerk reaction, but instead of being knee jerk, it’s saying we know we have a problem with data. We know we have some data values, and we know how to deal with them. We’re going to do so in some systematic way, but in order to prevent the loss of trust in our reporting system, we need to get ahead of this. In other words, we need to flag these dirty data issues in such a way that we can impact them before the end users that use these reports make bad decisions or lose faith in reports due to these problems. This is done often through exception reporting.

This can take several different forms, but just like the name implies, we’re going to find these exceptions. It could be unexpected values; it could be anything. It’s dirty data. It just really depends on the certain situation, but we’re going to take these unexpected scenarios, and put them into a report of their own. We’re going to put them into a separate report and we’re going to get that report into the hands of the people that know how to deal with those problems. Who are those people? It may be data stewards. It might be data entry folks. It could be a technical person. Whomever it is, we’re going to identify these exceptions and we’re going to knowingly deal with them, so that we don’t cause our end users to lose faith in reporting. That’s exception reporting.

Another way is what I’ll call the technical solution. This is a whole domain of work. A whole level of expertise that can be the discussion of a whole podcast series. But this technical solution, it’s just that. The scenario is that we know we have data quality issues. We’ve got dirty data and we want to find a way to solve those as problems. There’s a lot of approaches to this, all the way from taking your source system and improving it, so it does have the proper validation. It does have the proper design. It has the proper configuration to support the things that your business wants it to do in a reliable way. That’s the technical solution. Getting a business application that supports what we’re trying to do. Or maybe your business application does support it, but it’s not configured in a way that is doing so. That would be a good scenario. We could just change a configuration and make things work the way we want.

The source system may be one way of solving the problem. Another way of solving this is through some type of data quality solution. This gets into the master data management, maybe reference data management and some type of data quality solution. These are solutions where we can take in this dirty data, apply some set of rules and those rules could be of any different form. It could be mappings where if a field equals A, then we translate it to B. OR if this field equals B, let’s leave it as B. It’s just some set of mapping rules can be in place.

It could be more complicated. If we know an address needs to be conformed to some type of standard, we can programmatically take an address that has been provided to us and bounce it off some type of a programmatic interface to correct that address. There’s a lot of different things we can do, but this whole area is the MDM or master data management/data quality technical solution area.

All these are ways of dealing with dirty data. I want to go back to the value opportunity. What exactly are we talking about here? We’re talking about capturing some value opportunity and for my business owner, business process manager, from your perspective, we’re trying to capture some value. We’ve trying to make something run more efficiently. We’re trying to cut costs. We’re trying to increase revenue. We’re trying to increase quality. We’re trying to reduce risk. All these dirty data issues need to be kept in perspective of that value opportunity we’re trying to capture.

The worst thing that can happen is not trying to use our data to capture a value opportunity because we know there’s problems. And again, like I said, when we started this episode, every company has data quality problems. It’s just the way things are. We’ve got to accept that. Some are more severe than others. Sometimes we create big messes and we’ve got to deal with that. But sometimes it’s not that big of a deal.

It’s important to look at the value opportunity, figure out the size of that opportunity. Figure out the importance of the data quality issues that we’re looking at and make reasonable decisions on how to deal with them. In the end, we can deal with these dirty data issues. Some of them are bigger than others, but we always have a way of dealing with them. It’s important that we keep that value opportunity insight and focus on that.

About the Podcast

In More Than Reports, Paul and Alex explore the many ways in which data technology can be used by small and midsize businesses to promote transformational growth. We talk openly about the opportunities and challenges companies face as they try to make sense of an increasingly complex environment. Together with our expert guests we break the facts away from the hype and never lose sight of our target: value.