Artificial Intelligence

Inference Theory of Learning (ITL) models are concerned with; how knowledge is interpreted; the validity of knowledge obtained; the use of prior knowledge; what knowledge can be derived at a given point given prior knowledge; how learning goals and their structure influence the learning process. This theory assumes learning is a process where agents are guided by a set goal and use past experience and knowledge to help reach that set of goals. Some alternative theories are, Computation Learning Theory (aka Statistical Learning Theory which is a simpler theory not concerned with multi-strategic learning goals) and Multi-strategy Task-adaptive learning (these models focus on the cognitive powers of the agents). ITL algorithms are at the core of many data mining techniques. However, you must remember we do not know for certain this is how humans learn so be careful when drawing parallels to the model’s behavior and human behavior. For data miners, the goal is not to model humans but to come up with power tools for analyzing patterns in data and the last thing you wish to enter when presenting results is a debate about how humans think. That is the third rail in any data mining discussion. While a data miner does not need to be an expert in theories of learning it is very useful to be at least aware of them when reading papers and evaluating new techniques.

2. Learning Goal

The learning goals are the key to any ITL process.  Without a set of goals (which may include only one goal) agents (or leaner) would just be complying information with no selection or inference of that information.  How the agent uses the knowledge space, which is the totality of beliefs, to benefit itself is defined in the set of learning goals. Five questions can define a learning goal.

a) What part of prior knowledge relevant?

b) What knowledge to acquire and of what form?

c) How to evaluate the knowledge?

d) When to stop learning?

3. Learning Methodology

New knowledge can enter in three forms:

a) Derived knowledge, (deductive transformation); knowledge generated by deduction from past knowledge inputs.

b) Intrinsic, (intrinsically new) knowledge; knowledge from an external source, such as a teacher or from analysis of past knowledge.

c) Pragmatically new knowledge, (managed knowledge); if the knowledge can not be obtained in space and time defined by knowledge space.

4. ITL algorithms

Examples of ITL algorithms.

Neural Networks

Genetic Algorithms

Some Monte Carol techniques with a feedback loop.

Support Vector Machines (Statistical Learning Theory)


Genetic Algorithms


Genetic Algorithms (GA) have been around for a while but have not seen widespread adoption in data mining circles as a primary modeling technique.  They are widely used in academic research simulate evolutionary environments and human behavior.  In Economics, many papers have shown GA are effective in simulating how learning effects the economy.  In these papers GA are used to model how people learn and adapt to changes in an economic system.

In data mining GA are mainly used coupled with other techniques such as ANN or as a voting mechanism.  Here, they can be quite powerful in choosing the best parameters or model.  As a primary modeling technique, such as predicting income, they have been less successful.  If you can remember genetics from your biology class and GA will seem familiar to you.  Genetic algorithms were developed following the Inference Theory of Learning (ITL).

Key features of a GA

1. Heterogeneous belief of agents.

2. Information requirement for agents is minimal.

3. Natural model for experimentation by agents with alternative rules.

4. Heterogeneity allows parallel processing of information.

3. Overview

The basic principles of heredity are at the core to GA. In GA the genes represents knowledge or a belief and the evolutionary process is to pick the best belief(s). The fitness of the belief is determined by its forecast error.  A select percentage of beliefs are eliminated based on their fitness. Key to GA is that fact knowledge good or bad is quickly shared through out the knowledge space. The agent’s beliefs are typically represented by bit strings.Below is an example of a simple belief bit string, X = (X1,X2, … ,Xn).

Position 4 indicates 1000

Position 3 indicates 100

Position 2 indicates 10

Position 1 indicates 1

So,  the bit string  0010  indicates a belief of 10.

2. Learning Methodology

a) Cross Over

Cross Over is when two agents share beliefs (genes); it can be viewed as simulated mating.  First, two newborn agents are randomly chosen to mate.  With some probability agents will perform crossover of their beliefs.  Crossover is achieved by dividing each agent’s beliefs at a random point and then swapping pieces.  This creates two new bit strings that represents new agents with new beliefs.

X = (X1,X2, … ,Xn) ->   X = (X1, … , Xi , Yi+1, … Yn)

Y= (Y1,Y2, … ,Yn)    -> Y = (Y1, … , Yi ,  Xi+1, …Xn)


Before Mating:

Agent 1’s  belief: 0010  = 10

Agent 2’s belief: 0001  = 1


Exchange segments 2 and 3.

After Mating:

Agent 3’s belief: 0000  = 0

Agent 4’s belief: 0011  = 1

b)  Mutations

To achieve an even more random new belief you can take the crossed over belief and mutate it.  Mutation involves randomly changing a bit string value to a new value, in this case from zero to one or one to zero.

e.g.  X’ = (X1, … , Xi-1 , Z ,  Xi+1 , … ,Xn) where Z is a mutation.

Agent 3’s belief: 0000  = 0 -> 1000  = 1000

c) Inversion

Another approach to create new information is to invert existing strings. Inversion is taking a randomly chosen segment of an agent’s bit string and inverting it.  This method, as with mutation, does not require the newborns to interact with one another while still allowing new information to be generated from the newborn’s belief.

e.g.  X’ = (X1 , …. Xi , Xj, Xj-1, … , Xi-1, Xj+1, …, Xn)

3. Selection

After the crossover, mutation and inversion processes are complete each newborn has a new belief. Once the new information has entered the system each newborn has two beliefs, that they received from the smarter parent and one they have created themselves.  The newborn then applies their learning goal so, the belief that would have yielded the highest utility the period before becomes the newborn’s forecast rule. Now one cycle of the learning process is complete.  This cycle is repeated until the stop criteria (such as a minimum forecast error) is achieved or the maximum number of cycles is reached.

Artificial Neural Networks

1. Intro
Artificial Neural Networks (ANN) are an attempt to simulate how the mind works. It was developed using the connectionist approach to computer learning.   They gained popularity among forecasters due to their ability to model non-linearities and interactions across variables.  ANN are used for a variety purposes from forecasting the stock market to pattern recognition to compression algorithms.  Critics of ANN often decry them as a black box.  In reality, the working of a ANN model are viewable but their foundation is not in statistics, but artificial intelligence which has a more limited audience.  ANNs have a tendency to over fit so use of a hold out sample and intelligent forecasting practices is required.

2. Overview
ANNs simulate how neurons operate in the brain by using a network of artificial neurons organized in layers with one input layer and one output layer. Artificial neurons (nodes) are simple processing elements that typical have one response value and one to many input values. The neuron is trained to minimize the predicted error by modifying how it responds to its data element; that data element may be the response of other neuron(s) from a higher level.  The simplest ANN has two layers, an input and an output layer.  The two layers are connected via weights, which are adjusted to minimize forecast error.  A model with one input layer and no hidden layer is very similar to a simple regression model.  The hidden layers are all layer between the input and output layers.  This name is misleading, they are not hidden and their weights can be viewed.

The above example is of a Multi-Layer Perceptron with one hidden layer.

3. Classification
There are many ways to classify neural networks and no consensus among researchers as to the best method. Below, I briefly cover three type of classification for ANN.  These classifications do overlap.

a) Learning features

1) Supervised: these ANN fit a model to predict output from inputs.  This is analogous to a regression model where the research choose a dependent variable. The output of the model is a fitted or predicted value.

2) Unsupervised: ANN has no desired output from the model.  They can be viewed as a form of data reduction like cluster analysis that finds a pattern among variables across observations.

b) Layer Structure

1) Single-layer Perceptron

Single-layer networks are an early attempt at ANN with no hidden layer.  The inputs are fed to the output using a series of weights.



2) Multi-layer Perceptron

A Multi-layer Perceptron consists of at least three layers, input, hidden output. This allows the system to model interactions across variables as well as nonlinearities.

Hidden Layer

c) Network Structures

1) Feed Forward 

In feed forward models the data flows directly from the input to the output.

2) Feed Back

Feedback models allow the output of the model to influence itself, feedback into the system.  It is a powerful means of dealing with issue such a serial correlation.

3) Kononen Self-Organizing Network

This is an unsupervised learning algorithm.

4. Common ANN Training Functions

a) Error Backward Propagation

The most widely used training algorithm is backward propagation (BP).  Backward propagation works by repeatedly looping through a learning cycle (when the neuron weights are recalculated) and readjusting the neuron weights and importance.  In each iteration you calculate the scaling factor (adjustment to the weights to better match the desired output) and assign an error to each neuron (the error is also called the blame and is used to adjust the neuron’s importance).  It is called backward propagation because information obtained by looking at the output node, the final mode, is applied upward through the structure.

Common activation functions:

livesrc= livesrc=
Sigmun function Step Function
livesrc= livesrc=
Sigmoid function Tan-hyperbolic Function

b) Radial Basis Function

Radial Basis Function (RBF) is used primarily for image recognition. It is similar to BP but with more restrict assumptions on learning resulting in faster computation.  Can only have one hidden layer.

c) Probabilistic

Probabilistic neural networks (PNN) are also similar to BP but do only one pass through the data and therefore have a much faster computation rate. It is also mainly used for images but also in cases where rapid response is necessary such as with robotics.

d) Recurrent Neural Network

Recurrent Neural Networks (RNN) are a type of feedback ANN. They are useful where serial correlation exists or the data is noisy.

e) Self-Organizing Feature Maps

SOFM are a type of Kononen Self-Organizing Network ANN.

5. Building a model

a. Preparing the data
For many neural network packages it is required to do the following:

1. Input variables must be bound between [0,1] if using a signmoid function.
2. Binary inputs are transform to [.1,.9]. This is because 0 and 1 are at the extreme of the choice functions and convergence may not be possible.

b. Settings
1. Number of hidden layers

The hidden layers are between the input and output layers.  Typically for modeling linear processes one hidden layer is sufficient.  Image recognition typically requires many hidden layers.  Too many hidden layers can result in over fitting.  In practice, unless the model warrants over fitting, like data compression, stick with one hidden layer.

2. Number of neurons for each hidden layer

One critical choice is the number of hidden layers and the number of neurons in each layer. A triangle formation is a good place to start. With the triangle formation the first hidden layer has the same number of neuron as they are input nodes; the next hidden layer has half as many neurons and so forth.

6. Example Code (R)