{"id":308,"date":"2021-10-18T00:00:00","date_gmt":"2021-10-18T00:00:00","guid":{"rendered":"https:\/\/tac.debuzzify.com\/?p=308"},"modified":"2023-06-24T12:58:13","modified_gmt":"2023-06-24T12:58:13","slug":"data-quality-assessment","status":"publish","type":"post","link":"https:\/\/www.the-analytics.club\/data-quality-assessment\/","title":{"rendered":"How to Improve Data Quality With Data Quality Assessment?"},"content":{"rendered":"\n\n\n

Data quality assessment is the continuous scientific process of evaluating if your data meets the standards. These standards may be tied to your business or the project goals.<\/p>\n\n\n\n

The need for ensuring data quality has increased as the many different ways to acquire data are multi-folded.<\/p>\n\n\n\n

Handling a single data source alone can be challenging at times. Say, for example, a customer survey. It’s often difficult to normalize respondents’ information, even with online survey tools. Now imagine integrating and standardizing data from ERPs, CRMs, HR systems, and not to mention the many different sensors we use these days. Without data quality assessments, these are a problem for a lifetime.<\/p>\n\n\n\n

But there is good news! We’ve evolved along the complexities surrounding data acquisition and management.<\/p>\n\n\n\n

Data quality assessments play a crucial role in data governance. They help us identify incorrect data issues in a data pipeline at various levels. They also help us quantify the business impact and take corrective measures as soon as possible.<\/p>\n\n\n\n

Poor quality data can have serious consequences.<\/h2>\n\n\n\n

Take, for instance, a data quality issue in the healthcare industry<\/a>. Suppose the data entry person duplicated a patient’s record; the patient would receive two doses of the drug instead of one. The consequences can be disastrous.<\/p>\n\n\n\n

Quality issues<\/a> such as the above can have terrible effects regardless of the industry. But duplication is only one kind of data quality issue. There is a spectrum of other quality problems we need to worry about.<\/p>\n\n\n\n

Let’s imagine you’re working on an inventory optimization problem. The stock is monitored through an automated system. What happens if one of your sensors sends values twice the original frequency? Unreliable data will lead you to stock up on items already in the warehouse yet missing out on all the high-demand stuff.<\/p>\n\n\n\n

See your data acquisition and management processes from different angles.<\/h2>\n\n\n\n

Data quality has six dimensions: Accuracy, Completeness, Consistency, Timeliness, Validity, and Uniqueness. We can also think of data accountability and its orderliness as other critical characteristics.<\/p>\n\n\n\n

\u00a0<\/p>\n\n\n

\n
\"6<\/figure><\/div>\n\n\n

 <\/p>\n\n\n\n

The different dimensions discussed here are the scales against which we evaluate the quality of our data. Maintaining 100% quality in a vast data lake is nearly impossible. Data quality tolerance is a strategic decision we must make as early as possible. But that’s for a future post.<\/p>\n\n\n\n

Data accuracy.<\/h3>\n\n\n\n

Data accuracy is a prevalent quality aspect everyone is battling to get right. But what does data accuracy mean anyway?<\/p>\n\n\n\n

Data accuracy is to what extent the data at hand captures reality. The apparent cause is the data entry — typos in the name and wrong values for age.<\/p>\n\n\n\n

But there are more disastrous issues.<\/p>\n\n\n\n

NASA once lost a $125 M spacecraft<\/a>. Lockheed Martin, a team of English engineering team, was working with NASA to run the program. Different measurement units used by the two groups caused the communication blackout with the spacecraft.<\/p>\n\n\n\n

Measurement units are the most common cause of data inaccuracy.<\/p>\n\n\n\n

Completeness of Data<\/h3>\n\n\n\n

Data completeness refers to the fact that your datasets have all the required information on every record. The requirements depend on the application and business needs. For instance, phone numbers have only a little use for a machine learning model, whereas it’s critical for a delivery system.<\/p>\n\n\n\n

Form validations and database constraints help a lot in reducing completeness errors. Yet, planning mistakes often make huge impacts on the quality of data.<\/p>\n\n\n\n

Data completeness is a tradeoff. The more strict you are on the fields, the less you get on records.<\/p>\n\n\n\n

This tradeoff is valid for both manual and automatic data acquisition. If you make all the fields mandatory in a survey, you don’t get as many responses as you intend. On the automated side, let’s say you put a constraint on GPS coordinates for a data stream coming from a remote camera. You install a set of new devices that may not support GPS and send data that won’t get accepted to your data lake.<\/p>\n\n\n\n

It is a challenging dimension to score well. The complexity grows high as you acquire data from more sources.<\/p>\n\n\n\n

Consistency in Data<\/h3>\n\n\n\n

Data consistency has no contradiction in the data received from different sources. Because each data source may have a unique way of measuring the information, they sometimes don’t match others.<\/p>\n\n\n\n

Say you want to find out the daily sales volume of a particular product. Your inventory management tracks sales based on the remaining items. Your POS tracks the same based on the items sold. Items returned may sneak into the inventory system without having a record in the POS.<\/p>\n\n\n\n

At the time of integration, these two systems would tell different numbers for daily sales volume.<\/p>\n\n\n\n

In the ideal world, both systems should account for returns. But it’s rarely the case, given the complexities of large-scale organizations.<\/p>\n\n\n\n

Timelines in data quality<\/h3>\n\n\n\n

Data should be available at the time it’s required in the system. Suppose you generate a report every Friday, and not all your data have arrived yet; it’ll seriously alter your organization’s decisions and directions.<\/p>\n\n\n\n

Several reasons affect the timeliness of data:<\/p>\n\n\n\n

    \n
  1. There are network issues. Read about edge computing<\/a> if you think cities have decent internet connections and nothing to worry about. The whole concept is built to reduce network latency.<\/li>\n\n\n\n
  2. There can be operational issues. The product returns and daily sales calculation are good examples of lack of timeliness too.<\/li>\n\n\n\n
  3. We have problems arising at the point of data collection. They could be wrong data entry, malfunctioning of sensors, etc.<\/li>\n<\/ol>\n\n\n\n

    Invalid data<\/h3>\n\n\n\n

    I finished my high-schools about 12 years ago. But I still receive brochures from institutions that are targeting school children. It’s an excellent example of having invalid data.<\/p>\n\n\n\n

    Invalid data are the set of records that don’t have meaning anymore. They fill up space with no use. Also, when they are used can be dangerous too.<\/p>\n\n\n\n

    Invalid data costs a lot, and yet the rules to invalidate are blurry in some cases. For example, how do we know if a patient has fully recovered from the disease unless you’re the doctor or the patient yourself? Some decease may have an average time range for recovery. But not all. In such cases, you keep invalid data in your data store and make painful (sometimes harmful) decisions based on them.<\/p>\n\n\n\n

    Uniqueness<\/h3>\n\n\n\n

    Uniqueness in data is having no replication of the same information twice or more. They appear in two forms; duplicate records and information duplication in multiple places.<\/p>\n\n\n\n

    Duplicate records are often easy to pick. They simply appear more than once in the same datasets and are relatively straightforward to remove automatically.<\/p>\n\n\n\n

    A good practice is to use a key column to impose a uniqueness constraint rather than the whole record. Because certain repetitive entries may contain some fields that aren’t unique anymore. Most transactional entries have a timestamp which is a perfect example. They don’t appear as duplicates if we don’t use one or a combination of a few fields for de-dupe.<\/p>\n\n\n\n

    Information duplication is storing the same information in different places. For example, a patient’s age may be on the admissions table and the surgery table. It’s not just a good design.<\/p>\n\n\n\n

    Duplicated information is the gateway to other quality issues. Failing to update all the records will create inconsistencies. At least one of them is, anyway, inaccurate.<\/p>\n\n\n\n

    Another not-so-apparent duplication is derived information. Take age and date of birth. One is enough to find out the other. But storing both creates ambiguity.<\/p>\n\n\n\n

    How to perform a data quality assessment?<\/h2>\n\n\n\n

    We need to perform data quality assessments for every critical area in our data store. The most granular you can go is to the field level. But you can also check all the way up to a database level.<\/p>\n\n\n\n

    \u00a0<\/p>\n\n\n

    \n
    \"Steps<\/figure><\/div>\n\n\n

    \u00a0Data quality assessment is an iterative process to verify if your data meets the required standards. Each iteration will have the following six phases.<\/p>\n\n\n\n

    1. Define data quality targets.<\/h3>\n\n\n\n

    In the ‘define’ phase, we translate the business goals into data quality targets and decide on what is acceptable quality. This matrix should be measured against each of the six dimensions of data quality.<\/p>\n\n\n\n

    Reaching 100% is unlikely in large-scale applications. But if you’re working with smaller datasets, you can be more strict about them.<\/p>\n\n\n\n

    If you are a healthcare app that sends subsequent dosage alerts, you need to maintain a log of every dose the patient took. The timestamp field in every record is a crucial piece of information for the next dosage. Hence it should have a threshold of nearly 100% against all six dimensions.<\/p>\n\n\n\n

    But in case you own a cake shop and want to send a birthday card every year, your rules can be far more flexible. The address or phone number field should have a high threshold (say about 90%) for accuracy. Yet, they can have a moderate target (somewhere around 60%) for uniqueness because people sometimes give their alternative phone numbers when they buy.<\/p>\n\n\n\n

    These thresholds also depend on the domain. As seen in the last two examples, the cost of a mistake is minuscule in the second case compared to healthcare.<\/p>\n\n\n\n

    These rules can be on multiple granularities. For instance, the address column can have a unique threshold. But we can also impose a completeness threshold as each record should have the phone number or the mailing address.<\/p>\n\n\n\n

    2. The data quality assessment<\/h3>\n\n\n\n

    In the ‘assessment’ phase, we evaluate our datasets against the rules we defined on the six data quality dimensions. Each will end up with an acceptable score. The acceptance score is the percentage of records that satisfy the conditions.<\/p>\n\n\n\n

    On a small dataset or database, it’s pretty easy to conduct these experiments manually. However, in a vast data warehouse<\/a>, you need some automation to verify the data quality.<\/p>\n\n\n\n

    3. Analyse the assessment score.<\/h3>\n\n\n\n

    Data quality assessment does not end with the assessment phase. A data quality assessment aims to identify the business impact<\/a> as early as possible and implement corrective measures. Estimating the business impact<\/a> is the goal of this phase.<\/p>\n\n\n\n

    It’s a complex exercise, and it’s not domain-agnostic. The way one organization summarizes the assessment scores differ from others.<\/p>\n\n\n\n

    But the goal of the phase is obvious. We are finding the most significant holes where the quality of data leaks<\/a> and will fix them.<\/p>\n\n\n\n

    4. Brainstorm for improvements<\/h3>\n\n\n\n

    In the ‘brainstorming’ phase, we collaboratively developed ideas that could fix the gaps we found. It is best to have a unit that includes members from every team so that the plans are<\/p>\n\n\n\n