Project AGI: September 2016

Typical results from our experiments: Some active cells in layer 3 of a 3 layer network, transformed back into the input pixels they represent. The red pixels are positive weights and the blue pixels are negative weights; absence of colour indicates neutral weighting (ambiguity). The white pixels are the input stimulus that produced the selected set of active cells in layer 3. It appears these layer 3 cells collectively represent a generic '5' digit. The input was a specific '5' digit. Note that the weights of the hidden layer cells differ from the input pixels, but are recognizably the same digit.

We are running a series of experiments to test the capabilities of the Region-Layer component. The objective is to understand to what extent these ideas work, and to expose limitations both in implementation and theory.

Results will be posted to the blog and written up for publication if, or when, we reach an acceptable level of novelty and rigor.

We are not trying to beat benchmarks here. We’re trying to show whether certain ideas have useful qualities - the best way to tackle specific AI problems is almost certainly not an AGI way. But what we’d love to see is that AGI-inspired methods can perform close to state-of-the-art (e.g. deep convolutional networks) on a wide range of problems. Now that would be general intelligence!

Dataset Choice

We are going to start with the MNIST digit classification dataset, and perform a number of experiments based on that. In future we will look at some more sophisticated image / object classification datasets such as LabelMe or Caltech_101.

The good thing about MNIST is that it’s simple and has been extremely widely studied. It’s easy to work with the data and the images are a practical size - big enough to be interesting, but not so big as to require lots of preprocessing or too much memory. Despite only 28x28 pixels, variations in digit appearance gives considerable depth to the data (example digit '5' above).

The bad thing about MNIST is that it’s largely “solved” by supervised learning algorithms. A range of different supervised techniques have reached human performance and it’s debatable whether any further improvements are genuine.

So what’s the point of trying new approaches? Well, supervised algorithms have some odd qualities, perhaps due to the narrowness of training samples or the specificity of the cost function. For example, the discovery of “adversarial examples” - images that look easily classifiable to the naked eye but cannot be classified correctly with a trained network because they exploit weaknesses in trained neural networks.

But the biggest drawback of supervised learning is the need to tell it the “correct” answer for every input. This has led to a range of techniques - such as transfer learning - to make the most of what training data is available, even if not directly relevant. But fundamentally, supervised learning is unlike the experience of an agent learning as it explores its world. Animals can learn without a tutor.

However, unsupervised results with MNIST are less widely reported. Partially this is because you need to come up with a way to measure the performance of an unsupervised method. The most common approach is to use unsupervised networks to boost the performance of a final supervised network layer - but in MNIST the supervised layer is so powerful it’s hard to distinguish the contribution of the unsupervised layers. Nevertheless, these experiments are encouraging because having a few unsupervised layers seems to improve overall performance, compared to all-supervised networks. In addition to the limited data problem with supervised learning, unsupervised learning actually seems to add something.

One possible method of capturing the contribution of unsupervised layers alone is the Rand Index, which measures the similarity between two clusters. However, we are intending to use a distributed representation where there will be overlap between similar representations - that’s one of the features of the algorithm!

So, for now we’re going to go for the simplest approach we can think of, and measure the correlation between the active cells in selected hidden layers and each digit label, and see if the correlation alone is enough to pick the right label given a set of active cells. If the concepts defined by the digits exist somewhere in the hierarchy, they should be detectable as features uniquely correlated with specific labels...

Note also that we’re not doing any preprocessing of the MNIST images except binarization at threshold 0.5. Since the MNIST dataset is very high contrast, hopefully the threshold doesn’t matter much: It’s almost binary already.

Sequence Learning Tests

Before we start the experiments proper we conducted some ad-hoc tests to verify the features of the Region-Layer are implemented as intended. Remember, the Region-Layer has two key capabilities:

Classification … of the feedforward input, and
Prediction … of future classification results (i.e. future internal states)

See here and here to understand the classification role, and here for more information about prediction. Taken together, the ability to classify and predict future classifications allows sequences of input to be learned. This is a topic we have looked at in detail in earlier blog posts and we have some fairly effective techniques at our disposal.

We completed the following tests:

Cycle 0,1,2: We verified that the algorithm could predict the set of active cells in a short cycle of images. This ensures the sequence learning feature is working. The same image was used for each instance of a particular digit (i.e. there was no variation in digit appearance).
Cycle 0,1,...,9: We tested a longer cycle. Again, the Region-Layer was able to predict the sequence perfectly.
Cycle 0,1,2,3, 0,2,3,1: We tested an ambiguous cycle. At 0, it appears that the next state can be 1 or 2, and similarly, at 3, the next state can be 1 or 2. However, due to the variable order modelling behaviour of the Region-Layer, a single Region-Layer is able to predict this cycle perfectly. Note that first-order prediction cannot predict this sequence correctly.
Cycle 0,1,2,3,1,2,4,0,2,3,1,2,1,5,0,3,2,1,4,5: We tested a complex graph of state sequences and again a single Region-Layer was able to predict the sequence perfectly. We also were able to predict this using only first order modelling and a deep hierarchy.

After completion of the unit tests we were satisfied that our Region-Layer component has the ability to efficiently produce variable order models of observed sequences using unsupervised learning, assuming that the states can reliably be detected.

Experiments

Now we come to the harder part. What if each digit exemplar image is ambiguous? In other words, what if each ‘0’ is represented by a randomly selected ‘0’ image from the MNIST dataset? The ambiguity of appearance means that the observed sequences will appear to be non-deterministic.

We decided to run the following experiments:

Experiment 1: Random image classification

In this experiment there will be no predictable sequence; each digit must be recognized solely based on its appearance. The classic experiment is used: Up to N training passes over the entire MNIST dataset, followed by fixing the internal weights and a single pass to calculate the correlation between each active cell in selected hidden layer[s] and the digit labels. Then, a single pass over the test set recording, for each test input image, the most highly correlated digit label for each set of active hidden cells. The algorithm gets a “correct” result if the most correlated label is the correct label.

Passes 1-N: Train networks

Present each digit in the training set once, in a random order. Train the internal weights of the algorithm. Repeated several times if necessary.

Pass N+1: Measure correlation of hidden layer features with training images.

Present each digit in the training set once, in a random order. Accumulate the frequency with which each active cell is associated with each digit label. After all images have been seen, convert the observed frequencies to correlations.

Pass N+2: Predict label of test images.

Present each digit in the testing set once, in a random order. Use the correlations between cell activity and training labels to predict the most likely digit label given the set of active cells in selected Region-Layer components (they are arranged into a hierarchy).

Experiment 2: Image classification & sequence prediction

What if the digit images are not in a random order? We can use the English language to generate a training set of digit sequences. For example, we can get a book, convert each character to a 2 digit number and select random appropriate digit images to represent each number.

The motivation for this experiment is to see how the sequence learning can boost image recognition: Our Region-Layer component is supposed to be able to integrate both sequential and spatial information. This experiment actually has a lot of depth because English isn’t entirely predictable - if we use a different book for testing, there’ll be lots of sub-sequences the algorithm has never observed before. There’ll be uncertainty in image appearance and uncertainty in sequence, and we’d like to see how a hierarchy of Region-Layer components responds to both. Our expectation is that it will improve digit classification performance beyond the random image case.

In the next article, we will describe the specifics of the algorithms we implemented and tested on these problems.

A final article will present some results.

Figure 1: The Region-Layer component. The upper surface in the figure is the Region-Layer, which consists of Cells (small rectangles) grouped into Columns. Within each Column, only a few cells are active at any time. The output of the Region-Layer is the activity of the Cells. Columns in the Region-Layer have similar - overlapping - but unique Receptive Fields - illustrated here by lines joining two Columns in the Region-Layer to the input matrix at the bottom. All the Cells in a Column have the same inputs, but respond to different combinations of active input in particular sequential contexts. Overall, the Region-Layer demonstrates self-organization at two scales: into Columns with unique receptive fields, and into Cells responding to unique (input, context) combinations of the Column's input.

Introducing the Region-Layer

From our background reading (see here, here, or here) we believe that the key component of a general intelligence can be described as a structure of “Region-Layer” components. As the name suggests, these are finite 2-dimensional areas of cells on a surface. They are surrounded by other Region-Layers, which may be connected in a hierarchical manner; and can be sandwiched by other Region-Layers, on parallel surfaces, by which additional functionality can be achieved. For example, one Region-Layer could implement our concept of the Objective system, another the Region-Layer the Subjective system. Each Region-Layer approximates a single Layer within a Region of Cortex, part of one vertex or level in a hierarchy. For more explanation of this terminology, see earlier articles on Layers and Levels.

The Region-Layer has a biological analogue - it is intended to approximate the collective function of two cell populations within a single layer of a cortical macrocolumn. The first population is a set of pyramidal cells, which we believe perform a sparse classifier function of the input; the second population is a set of inhibitory interneuron cells, which we believe cause the pyramidal cells to become active only in particular sequential contexts, or only when selectively dis-inhibited for other purposes (e.g. attention). Neocortex layers 2/3 and 5 are specifically and individually the inspirations for this model: Each Region-Layer object is supposed to approximate the collective cellular behaviour of a patch of just one of these cortical layers.

We assume the Region-Layer is trained by unsupervised learning only - it finds structure in its input without caring about associated utility or rewards. Learning should be continuous and online, learning as an agent from experience. It should adapt to non-stationary input statistics at any time.

The Region-Layer should be self-organizing: Given a surface of Region-Layer components, they should arrange themselves into a hierarchy automatically. [We may defer implementation of this feature and initially implement a manually-defined hierarchy]. Within each Region-Layer component, the cell populations should exhibit a form of competitive learning such that all cells are used efficiently to model the variety of input observed.

We believe the function of the Region-Layer is best described by Jeff Hawkins: To find spatial features and predictable sequences in the input, and replace them with patterns of cell activity that are increasingly abstract and stable over time. Cumulative discovery of these features over many Region-Layers amounts to an incremental transformation from raw data to fully grounded but abstract symbols.

Within a Region-Layer, Cells are organized into Columns (see figure 1). Columns are organized within the Region-Layer to optimally cover the distribution of active input observed. Each Column and each Cell responds to only a fraction of the input. Via these two levels of self-organization, the set of active cells becomes a robust, distributed representation of the input.

Given these properties, a surface of Region-Layer components should have nice scaling characteristics, both in response to changing the size of individual Region-Layer column / cell populations and the number of Region-Layer components in the hierarchy. Adding more Region-Layer components should improve input modelling capabilities without any other changes to the system.

So let's put our cards on the table and test these ideas.

Region-Layer Implementation

Parameters

For the algorithm outlined below very few parameters are required. The few that are mentioned are needed merely to describe the resources available to the Region-Layer. In theory, they are not affected by the qualities of the input data. This is a key characteristic of a general intelligence.

RW: Width of region layer in Columns
RH: Height of region layer in Columns
CW: Width of column in Cells
CH: Height of column in Cells

Inputs and Outputs

Feed-Forward Input (FFI): Must be sparse, and binary. Size: A matrix of any dimension*.
Feed-Back Input (FBI): Sparse, binary Size: A vector of any dimension
Prediction Disinhibition Input (PDI): Sparse, rare. Size: Region Area+
Feed-Forward Output (FFO): Sparse, binary and distributed. Size: Region Area+

* the 2D shape of input[s] may be important for learning receptive fields of columns and cells, depending on implementation.

+ Region Area = CW * CH * RW * RH

Pseudocode

Here is some pseudocode for iterative update and training of a Region-Layer. Both occur simultaneously.

We also have fully working code. In the next few blog posts we will describe some of our concrete implementations of this algorithm, and the tests we have performed on it. Watch this space!

function: UpdateAndTrain(

feed_forward_input,

feed_back_input,

prediction_disinhibition

)

// if no active input, then do nothing

if( sum( input ) == 0 ) {

return

}

// Sparse activation

// Note: Can be implemented via a Quilt[1] of any competitive learning algorithm,

// e.g. Growing Neural Gas [2], Self-Organizing Maps [3], K-Sparse Autoencoder [4].

activity(t) = 0

for-each( column c ) {

// find cell x that most responds to FFI

// in current sequential context given:

// a) prior active cells in region

// b) feedback input.

x = findBestCellsInColumn( feed_forward_input, feed_back_input, c )

activity(t)[ x ] = 1

}

// Change detection

// if active cells in region unchanged, then do nothing

if( activity(t) == activity(t-1) ) {

return

}

// Update receptive fields to organize columns

trainReceptiveFields( feed_forward_input, columns )

// Update cell weights given column receptive fields

// and selected active cells

trainCells( feed_forward_input, feed_back_input, activity(t) )

// Predictive coding: output false-negative errors only [5]

for-each( cell x in region-layer ) {

coding = 0

if( ( activity(t)[x] == 1 ) and ( prediction(t-1)[x] == 0 ) ) {

coding = 1

}

// optional: mute output from region, for attentional gating of hierarchy

if( prediction_disinhibition(t)[x] == 0 ) {

coding = 0

}

output(t)[x] = coding

}

// Update prediction

// Note: Predictor can be as simple as first-order Hebbian learning.

// The prediction model is variable order due to the inclusion of sequential

// context in the active cell selection step.

trainPredictor( activity(t), activity(t-1) )

prediction(t) = predict( activity(t) )

[1] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.84.1401&rep=rep1&type=pdf
[2] https://papers.nips.cc/paper/893-a-growing-neural-gas-network-learns-topologies.pdf
[3] http://www.cs.bham.ac.uk/~jxb/NN/l16.pdf
[4] https://arxiv.org/pdf/1312.5663
[5] http://www.ncbi.nlm.nih.gov/pubmed/10195184

Project AGI

Building an Artificial General Intelligence

Tuesday 20 September 2016

Region-Layer Experiments