Big data? Big deal. Just another means to make poor decisions from ‘garbage-in, garbage-out’ data quality, except on a grander scale. Now, more than ever, it’s vitally important to ensure data quality to support business decisions.
Data Quality is Essential for Good Decision Making
What good are your analyses and conclusions to support decision-making if data quality is crummy? Poor data quality can produce decisions that are lemons. Here’s nine ways to improve data quality, and hence the quality of your business decisions.
Data Quality Tip #1. Understand the purpose and context of your data.
Data are never free. It takes resources to collect and maintain data. Don’t squander your investment. Take the time to understand why you’re collecting each piece of data. Make sure the data has a use. Why? Unused data are unloved data. And unloved data are ignored. That means there’s a good chance that no one is paying attention to ensuring the data are error free and will lead to poor or decreasing data quality. Even though a lot of data today are generated to satiate our ever-increasing appetite for bureaucracy, rules and regulations, data ultimately should exist for a business or mission purpose. If there is no reason or justification for using the data, then why collect it in the first place?
Data Quality Tip #2. Create, maintain, and use a data dictionary.
A data dictionary is a valuable tool for documenting your data. It is your bible for ensuring data quality. Don’t have one? Get one! Here’s some essential elements to include in a good data dictionary to improve data quality:
- Identity ALL data elements. Data elements are like barnacles on a ship. They accumulate over time, especially in informal data systems like spreadsheets. You can save a lot of time and money on training and on IT projects by having an up to date list of all your data elements. Also, what happens when Mary Jo, who originally developed the database and its associated data fields, departs the organization? Who really understands what’s in the database?
- Provide definitions of data elements. Just want does ‘pflangular momentum’ mean? I have no idea and neither will anyone else. When left to our own imagination, it’s amazing what we might think up to understand the data. That leads to interpretation errors which indirectly reduces data quality. Provide thorough and exact definitions your data elements so there’s no room for misinterpretations.
- Specify validation rules. What are the expected values for each data field?
- Data type (Number, Integer, Text, Date, etc)
- Mins and maxes – Is it possible to have a negative sales on your project in a given month?
- Acceptable, unacceptable values – Is ‘YY’ a valid code for a state in the U.S.?
- Codification rules – What are those columns of ones and zeros in your spreadsheet data table supposed to represent? Is Invoice #245-678 randomly generated or does each number segment represent some additional information?
- Classification rules – What are the rules to decide if the data belongs in one group or another? Will this be left up to the subjective interpretation of users?
- Units – Who could forget the classic mistake made by engineers designing the Mars Climate Observer? Instead of using metric units for thruster performance data, the designers used imperial units. Leave nothing for granted. Identify units.
- Identify authority: Where does the data originate? Perhaps this is not essential in simpler systems, but in large complex organizations, understanding the ultimate source of a data element is critical to data quality.
- Note the frequency of update – Some data elements are static and updated infrequently. Other data are dynamic and is event or transaction driven by underlying business processes.
- Describe proper uses and access – Are all the data elements equally accessible or should some elements be treated carefully? The answer to these questions will be helpful to uncover possible user roles for access to the data. (ie. access “This data field contains proprietary information not for release to a guest user”). I realize that this issue of propriety and access isn’t a data quality issue per se, however this is an important issue in data systems with different populations of users.
- Specify scope. Some data are aggregated. For example, we don’t necessarily want to know the underlying data behind the price of purchasing our new home. But we most certainly want to know what’s included in the data element labeled “price”, from a big picture perspective. Our data dictionary should provide us with information on what is included or excluded from the data element’s definition of price. Without this information, we’re left with lower data quality and the risk of comparing apples with oranges.
Data Quality Tip #3. Take snapshots your ‘static’ data.
Taking snapshots of static data over time to show changes to data or emergent data quality issues. Have records been deleted from one update to the next? Why? Did the status of one of our employees change from single to married? Sometimes, the only way to know that something has changed is to compare the data at one point versus another point of time to uncover differences. Sometimes those differences are legitimate. Other times, it’s the result of an error somewhere in the system.
Data Quality Tip #4. Clean your data.
Okay, so you didn’t take my advice in Tip #2 about building and using a data dictionary. One of the consequences is that you’ll likely get data that is factually correct, but potentially difficult to use and manipulate correctly in analytics systems. Data quality should also be concerned with usability in analytics. Here’s some common issues to address to keep your data clean.
- SENtenCe caps sometimes cause searches and filters to give incorrect results
- Zip Codes (+4?)
- SSN#s and Telephone #’s (include the dashes? how about dots, how about no separators?)
- Dates. Who hasn’t enjoyed the wonderful surprise of Excel converting our date data to a floating point number when we export the data to another system? And of course, in my DoD world, we loved to specify dates in different orders, like 28 November 2012 instead of November 28, 2012.
- Abbreviations. They should be consistent or converted. 1234 Evergreen Rd and 1234 Evergreen Road should be equivalent. How much programmer time is spent on making our software understand these two addresses are the same? Why not ensure the data are consistent as it comes into our systems?
Data Quality Tip #5. Aim for objectivity.
Here’s a classic from my days of building databases of completed software development projects. The databases contained a field to track ‘Programmer Capability’. The possible values ranged from ‘Extra Low to Extremely High’. Imagine a software program manager populating the database exclaiming, “We’re ACME Software, our programmer ability is Extra Low!” My point? Beware of using subjective categories to stratify data. When we categorize our data, it’s especially important to use classification rules that are as objective as possible. Otherwise, there’s a great risk of human biases, inconsistent perceptions, and misinterpretations to ruin data quality and render the data worthless.
Data Quality Tip #6. Originate data from one source.
Remember the phone game? You tell a friend one thing. Your friend relays that info to another, and that person relays it to another. At each step, error is introduced into the story. The same can process can happen with your data and consequently harm data quality. Replicating, maintaining, and re-distributing local copies of data increases convenience and utility for users of the data, but it also risks increasing the errors. A similar problem occurs when the same data are generated more than once from different sources. That is inviting data conflict and data error. There’s almost always only one authoritative source or business process that is responsible for the authentic origination and creation of the data. Find that source and ensure that the data generated is accessible by others in the organization. To borrow the carpenter’s cliche, “Generate data once, let it be used by many”.
Data Quality Tip #7. Consider removal of computed (or derived) data.
If my database contains Sales and Profit, then Profit Margin is something that is computed using the two other values. But storing the resulting computed value is redundant. It increases our storage and maintenance costs and it potentially hides critical business logic. If for some reason, our computation of profit margin changes, we have to re-compute and re-store that value in our databases, introducing a greater chance for injecting error. So, store the algorithm and formula, and not the result.
Data Quality Tip #8. Be vigilant about missing data.
Call me a purist, but I think there’s something sinister about imputing data– the methods of filling in missing values by sophisticated statistical guesstimating of what that value should be, based on observation of other data. It’s what your gas company does if it ‘misses’ a reading from your home’s gas meter. They’ll impute what your meter reading should have been based on reading of the same month in past years. That’s fine because the gas company used the estimate of your missing meter reading, using your past meter data, to make a decision about how much they should charge you. However, if the gas company wanted to compute an overall average of your gas meter readings, it would be unsound to include the imputed readings in the computations. Be careful about mixing imputed data and real data in analyses that draw conclusions from the data. If the data are that important, then go back and ensure you get the observations in the first place.
Data Quality Tip #9. Perform regular reviews of your data to uncover anomalies.
There’s no way around it. If you want to really understand your data and ensure data quality, you have to roll up your sleeves, dive in, and get dirty in the data. Reviewing your data is like a doctor reviewing their patients. They develop and understanding of what ‘normal’ looks like for a given patient. When the patient’s readings change drastically, the doctor knows something is not right and will either make sure the readings are correct (i.e. NOT a data error) or the doctor will take appropriate actions based on those aberrant readings. Without a baseline understanding of the data, the doctor does not have the right level of context or sense of normality.
Final Thoughts on Data Quality
Whether you’re collecting ad-hoc data for supporting a management decision, building a past performance database for future business development, or harvesting results from completed projects to innovate and improve business, it all starts with data quality. Put some forethought into your next data collection project or, if you already maintain data in your day-to-day job, consider taking a moment to pause and start building that data dictionary.
Need help defining, collecting or improving data quality? Contact Valerisys for a free no-obligation consultation.