It is widely reported that over 80% of a data scientist’s time is spent cleaning and engineering data. Great effort is put into preparing the information that will feed a data scientist’s models. Including irrelevant information or messy data in the modeling cycle can lead to models that are inaccurate or show false insights.
As such, one of the first steps of data cleaning is understanding what features or attributes will be available for the work and what type of attribute they will be. Data can be measured on four main scales: Interval, Ratio, Nominal, and Ordinal. Knowing the difference in these scales allows the data scientist to correctly structure the available data and informs them what statistical methods can be applied to said data.
Data that can be classified on the Interval and Ratio scales will be numeric. Interval and Ratio data can be both continuous and discrete. With these variables, defined values separate units of measure encountered in the data set. The key difference between these data measure types is that Interval data have no “true zero”.
Interval data are ordered data. There is also a known distance between data points that has credible meaning. An example of this is time — we know that 30 minutes separate 1:00 and 1:30; similarly, 30 minutes also separates 1:30 and 2:00, meeting the consistent values separating units of measure. It is also widely agreed that 1:00 comes before 1:30, satisfying order. However, there is no absolute zero to a variable like time. The lack of a “true zero” limits the arithmetic operations which can be applied to interval data to addition and subtraction — we can measure the number of days since a customer last made a purchase, or the number of days we’ve been dealing with the coronavirus pandemic (at least an eternity, right?), but we would not divide November by March.
Another example of an interval variable is a student’s SAT scores; numbers between 0 to 200 are not used when the College Board scales raw scores. Now, interval data can contain a zero — it is just that “0” does not carry as much meaning as “true zero”. Temperature is a great example of this. Considering the interval variable of temperature in Fahrenheit — we can have 0o F, but that does not mean there is no temperature, it just means you better be wearing a winter jacket.
When you have ordered numeric data with a clear definition of zero, you are dealing with Ratio data. Classic examples of ratio data are height/weight and price. If a person has zero height or weight, they don’t exist; a customer can’t pay less than $0.00 for an item at a store.
Measures of central tendency — mean, median, mode, and standard deviation — can be computed for these numerical data measures, but it is only with the presence of the true zero in ratio measures that a wide range of descriptive and inferential statistics can be applied. Interval and Ratio levels of measurement are considered to be much more exact than their qualitative counterparts.
Qualitative data are either nominal or ordinal. Qualitative data that are ordered are considered Ordinal level and those that are unordered are considered Nominal level.
Ordinal data are usually numbered to represent rank or order in a list, but the numbers typically reflect opinion or observation and are not mathematical measures. Ordinal data are often seen as ranked responses, like from a survey. Unlike interval data, the difference between the unit responses is not necessarily uniform as gaps between non-numeric concepts can vary — e.g. the cognitive difference between neutral and agree is potentially wider than the distance between agree and strongly agree.
Ranked placement in a race is another strong example of how ordinal data lack standard unit variance. The difference between 1st and 2nd place might be a matter of 1/10th of a second while the difference between 2nd and 3rd place could be 4/10ths of a second — the inconsistency between placing makes the division between placements less meaningful.
Because there is this variance in measures, there is only so much information that can be extracted from ordinal data. Central tendency can be captured through mode and median, but a purist will insist that a mean cannot be defined from an ordinal set. There are some measures that can be taken during data collection to maximize information gain for ordinal data — e.g. designing a customer satisfaction survey that aligns to the Likert scale — but as the data shift away from continuous numerical data, its usefulness can be debated.
Nominal measures are often considered to be the lowest level of data classification as they provide the least information. Nominal data is often seen in binaries (sale was made online or not), categories (a customer’s identified race), and sets of things (e.g. types of tomatoes — cherry, Roma, San Marzano…). Nominal data can have numbers assigned to them — e.g. 1 for Female, 2 for Male, 3 for Other, 4 for Unspecified — but unlike with ordinal data, the numbers do not reflect that one category is better than another. Mode is the measure of central tendency for nominal data; other quantitative descriptions are not appropriate for measuring these unordered categorical features.
Nominal data is interesting in that it has a few subcategories. One subcategory is “nominal with order” — this describes data that has some order but is not ranked — hot/warm/cold. This is difficult to separate from ordinal. Then there is “nominal without order” — this would be something like eye color or race where there is grouping into unique categories. Finally, there are “dichotomous” nominal data. These data are further broken down into either binary data or discrete or continuous dichotomous variables. Binary variables are variables assigned either a 0 or 1, e.g. Heads(0) or Tails(1) for a coin flip. Discrete dichotomous variables are where there is no possible outcome other than one of two options, e.g. whether the dress was blue and black or white and gold. Continuous dichotomous variables represent outcomes where there are possibilities in between, e.g. if a house is within 5 miles of the town center it is considered to be in a metropolitan area, but if it is more than 5 miles from town center, it is listed as a rural property.
While numeric data can have arithmetic measures applied, measurements for nominal and ordinal data types are largely reduced to such measures as proportionality and frequency of distribution of the attributes they inform us on. Even with this reduced capacity, there are tests that can be applied if the data scientist has a sound understanding of the data they are working with. I will aim to dive into that in my next post!