Sympathy for the learner: Abuse #1

Abuse #1 Throwing data at the learner

As data mining becomes more popular in our risky times invariably the profession is becoming sloppy. I see this in research papers, interactions with consultants, and vendor presentations. It is not technical knowledge that I see lacking but sympathy for the learner. Many in the data mining field, for lack of a better word, abuse their learners. For those of you who are not data miners let me give you a brief overview of what I mean by a learner. Suppose you have a collection of data and a problem (or concept) that you hope can be better understood via that data. The learner is whatever method or tool you use to learn (estimate) the concept that you are trying to describe. The learner can be a linear regression, neural network, boosted tree, or even a human.

One way we abuse our learners is the growing tendency to throw data at learner with little consideration for the data’s presentation in hopes that amidst the cloud of information the concept will magically become clear. Remember a boosted tree knows nothing more than what is in the data. A boosted tree was not provided an education or even given the ability to read a book. Most learners have no common sense of knowledge and even forget what it learned in the previous model. Because of this any common sense knowledge about how the data works can provide a tremendous amount of information to the learner sometimes even exceeding the initial information content of the data alone.

Example: Say you are trying to model the optimal coverage for an automobile insurance policy. In the data, you have the number of drivers and vehicles. Common sense tells you it is important if there is a disparity between drivers and vehicles. An extra vehicle can go unused and an extra driver can’t drive. How can a learner ‘see’ this pattern? If it is a tree it creates numerous splits, (if 1 driver and 2 vehicles do this, if 2 drivers on a vehicle do this, …). Essentially the learner is forced to construct a proxy for the fact about whether there are more cars than vehicles. There are several problems with this, there is no guarantee the proxy will be correctly created, it makes the model needlessly complex, and it crowds out other patterns from being included in the tree. A better solution is to introduce a flag indicating more cars than drivers. Although this is a mere one-bit field behind is the complex reasoning as to why the disparity between drivers and vehicles matters and therefore it contains far more information than one bit. A simple one-bit field like this can make or break a model.

The presentation of the data to the learner is just as important as the data itself. What can be obvious, (more cars than drivers, international verse domestic transactions), can be pivotal in uncovering complex concepts. As a data miner put yourself in the leaner’s shoes and you will find yourself giving more sympathy to the learner.