Classifying data for successful modeling
This paper discusses the natural characteristics of data in general. Understanding these characteristics help you classify the data appropriately while doing data modeling or data mining. This tutorial is written for the data modelers and data miners.
What is data?
Let us begin our discussion by defining what is data. Data are values of qualitative or quantitative variables, belonging to a set of items. Simply put, it's an attribute or property or characteristics of an object. Point to note here is, data can be both qualitative (brown eye color) and quantitative (20cm long).
A common way of representing or displaying a set of correlated data is through table type structures comprised of rows and columns. In such structures, the columns of the table generally signify attributes or characteristics or features and the rows (tuple) signify a set of co-related features belonging to one single item.
While speaking about data, it is important to understand the difference of data with other similar terms like information or knowledge. While a set of data can be used together to directly derive an information, knowledge or wisdom is often derived in an indirect manner. In our previous article on learning data mining, we have given examples to illustrate the differences in data / information and knowledge. Using the same example, consider a store manager of a local market sells hundreds of candles every Sunday to its customers. Which customer is buying the candles on any certain date, those are the data that are stored in the database of the store. These data gives information like how many candles are sold from the store per week - this information may be valuable for inventory management. These information can be further used to indirectly infer that people who buy candles on every Sunday goes to Church to offer a prayer. Now that's knowledge - it's a new learning based on available information.
Another way to look at it is by considering the level of abstraction in them. Data is objective and thus have the lowest level of abstraction whereas information and knowledge are increasingly subjective and involves higher levels of abstraction.
In terms of scientific definition, one may conclude that data have higher level of entropy than information or knowledge.
Types of Data
One of the fundamental aspects you must learn before attempting to do any kind of data modeling is the fact that how we model the data depends completely on the nature or type of data. Data can be both qualitative and quantitative. It's important to understand the distinctions between them.
Qualitative data are also called categorical data as they represent distinct categories rather than numbers. In case of dimensional modeling, they are often termed as "dimension". Mathematical operations such as addition or subtraction do not make any sense on that data.
Example of qualitative data are, eye color, zip code, phone number etc.
Qualitative data can be further classified into below classes:
Nominal data represents data where order of the data does not represent any meaningful information. Consider your passport number. There is no information as such if your passport number is greater or lesser than some one else's passport number. Consider Eye color of people, does not matter in which order we represent the eye colors, order does not matter.
ID, ZIP code, Phone number, eye color etc. are example of nominal class of qualitative data.
Order of the data is important for ordinal data. Consider height of people - tall, medium, short. Although they are qualitative but the order of the attributes does matter, in the sense that they represent some comparative information. Similarly, letter grades, scale of 1-10 etc. are examples of Ordinal data.
In the field of dimensional modeling, this kind of data are sometimes referred as non-additive facts.
Quantitative data are also called numeric data as they represent numbers. In case of dimensional data modeling approach, these data are termed as "Measure".
Example of quantitative data is, height of a person, amount of goods sold, revenue etc.
Quantitative attributes can be further classified as below.
Interval classification is used where there is no true zero point in the data and division operation does not make sense. Bank balance, temperature in Celsius scale, GRE score etc. are the examples of interval class data. Dividing one GRE score with another GRE score will not make any sense.
In dimensional modeling this is synonymous to semi-additive facts.
Ratio class is applied on the data that has a true "zero" and where division does make sense. Consider revenue, length of time etc. These measures are generally additive.
Below table illustrates different actions that are possible to implement on various data types
It is essential to understand the above differences in the nature of data and suggest appropriate model to store them. Many of our analytical (e.g. MS Excel) and data mining tools (e.g. R) do not automatically understand the nature of the data, so we need to explicitly model the data for those tools. For example, "R" provides 2 test function "is.numeric()" and "is.factor()" to determine if the data is numeric or categorical (dimensional) respectively, and if the default attribution is wrong we can use functions like "as.factor()" or "as.numeric()" to re-attribute the nature of the data.