In 2010 my best friend, knowing that I worked with neural networks and was passionate about brain’s miracles, presented me the best gift : it was a book “On intelligence” by Jeff Hawkins. The author presented a new way of looking on brain’s extraordinary capacity to think: neocortex makes the predictions about the future, based on the past input data (data = all that we could perceive by our sense organs) and that was it about our thinking.
This concept, so simple, especially on real-life examples and experiments, and so far, very convincing, really impressed me. It sounded very similar to principles of the neural networks, well known since the WW2. However, Hawkins claimed that if we had tried to imitate not only logic of the concept but also biological structure of neocortex, it could have brought us to the new artificial intelligence, faster and more efficient.
That theory seemed to me more philosophical rather than mathematical. After one of our philosophy classes on David Hume’s doctrines about the mind, I discussed “On intelligence” with our professor, as kind of more logical and advanced version of Hume’s empiricism.
Last April I stumbled across a news that IBM started a collaboration with a start-up Numenta for R&D project on brain-like software. I had a look to find out about that start-up, and was very surprised to discover that its co-founder was Jeff Hawkins. Another surprising thing was that their Hierarchical Temporary Memory innovation was based on the theory described in “On intelligence”. Along the commercial products, like Grok, Numenta released an open-source framework Nupic to bring together people interested in this technology and let them collaborate and propose the ideas for the new applications. Since that discovery I have dived into the Nupic, one of the most complex machine learning frameworks that, contrary to my expectations about complex algorithms, still worked well on real data.
Since the initial goal of Numenta was to imitate the neocortex, the terminology is similar to that used in neuroscience, with some derivations due to technical needs.
The core of Nupic lies in Hierarchical Temporal Memory, mostly theoretical framework, that supposed to globe all the algorithms and methods of Nupic.
Hierarchical Temporal Memory (HTM) is a machine learning technology that aims to capture the structural and algorithmic properties of neocortex.
HTM White Paper, Numenta
Baby is learning to speak
To quickly understand how HTM works, imagine a baby who is learning to speak.
First of all, he learns to separate sounds “m” “a” “p” “a” – it is the first and the lowest level of HTM. On the next layer, the baby learns to combine these sounds in the syllables “ma” “pa”, and on this stage he has to learn only combinations, and not the separate sounds, as they have been already learnt during the lowest level. On the higher level the baby combines the syllables into words, and on the next level – words in phrases… Every level uses the output of the previous one, which is saved in memory, so that the baby doesn’t need to learn the basic sounds in order to compose the phrases, he just manipulates the outputs of the previous level, getting the right words.
So the principle is that if one already knows more basic components, he could save time and resources and just learn more complex sequences of these components.
It should be noted that the traditional algorithms often face the problem: when the baby learns “mama” in the way his mother pronounces it, and one day a family friend from abroad comes to see him and pronounces “mama” with a different accent, the baby will not understand him because of that distortion.
Thanks to its special data processing, the HTP copes with it in the following three steps:
Sparse Distributed Representation
Traditional binary methods store information in dense representations. For example, let us take two numbers that are close each another, say 15 and 16, the binary representation for them will be though very different because of constraints imposed by the “density”. Human neocortex has about 100 billion neurons, but at any instance only small part of them is active. Nupic, by analogy, uses this sparse representation to bring the semantic meaning to bits. So, for instance, the “ones” in the Nupic binary representation will be displaced proportionally to encoded values.
The output of the this encoder, sparse distributed representation, will be the input for a spatial pooler, and here the miracles start to happen Every input contains only limited part of active bits, “sparsed” in the space of the input. While comparing two inputs of columns, matching active bits signifies semantic similarity, that helps to recognize the pattern even in the presence of noise (different accent of uncle’s “mama” ).
The framework is always exposed to the time-changing inputs, and the Temporal Pooler takes this importance of time into account. It can be viewed as a particular and somewhat more detailed kind of Spatial Pooler, where we are referring not to active columns, but to their components – cells. When a cell becomes active, it forms connections to other cells that were active just before, forming in this way the sequence stored in the memory, distributed among the individual cells. Every cell participates in many different sequences and patterns. An active cell can predict what will happen further just by looking at its connections in the sequence.
How to use it
Let us take a bit of a source code from Numenta’s github official repository for the simplest example of sin wave prediction.
The global scheme of learning is pretty similar to the traditional algorithms.
The class ModelFactory creates the model that will predict ‘predictedField’ over the test dataset, based on description dictionary MODEL_PARAMS.
Choosing “TemporalAnomaly” as the inference type, we want to see the anomaly score, mostly to observe more easily how the model improves itself with time. The highest anomaly score detects the value that is not expected. For instance, if an anomaly score is normalized between zero and one, zero represents a completely predicted value, whereas one represents a completely anomalous value.
This dictionary is output of swarming, the method that determines the components of the model and learns the best parameters for each component over the training dataset. Best prediction is chosen by the maximization of likelihood function.
The plot shows the evolution of learning/prediction over time, so it becomes clear how in the beginning every sin wave seems an anomaly to the algorithm, and its score is very high, whereas in the end of learning the algorithm understands that the wave is a pattern, and score decreases to zero.
At second glance?
So it was the simplest example, just to show the main functionalities of Nupic – prediction and anomaly detection. The theoretical base as well as code examples, and algorithms’ implementations are documented, and, with growing community, supported much better nowadays. Next article will cover the technical aspects of Nupic implementation on one of Kaggle datasets.
Challenges, organized by Numenta, enable to discover new ways of applying this technology. To conclude, I would like to quote Jeff Hawkins, who in the beginning of his career tried to join the MIT AI lab, and proposed to investigate biological aspects and neuroscience in order to bring new ideas into AI.
We’re just going to program computers; that’s all we need to do.
And I said, no, you really ought to study brains.
They said, oh, you know, you’re wrong.
And I said, no, you’re wrong, and I didn’t get in.
about his job interview at AI lab, MIT