
Detecting relationships between objects remains a tremendous challenge in modern computer vision.

I recently started an AI-focused educational newsletter, that already has over 70,000 subscribers. TheSequence is a no-BS (meaning no hype, no news etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers and concepts. Please give it a try by subscribing below:
Relational reasoning is a key component of fluid intelligence. Since we are babies we learn to detect relationships between objects in space and time. Relational reasoning is so ubiquitous in our thinking that we barely notice it. Every time that we are deciding which road to take or when we are piercing together different episodes of a movie thriller or a detective novel to discover the end plot, we are effectively using this cognitive skill. To date, we know very little about the brain mechanisms that allow us to detect spatial and temporal relationships between objects and how they evolve overtime. From the artificial intelligence(AI) standpoint, relational reasoning has proven to be an elusive goal to most deep neural network models. A couple of years ago, DeepMind published one of the most important papers in this area.
As humans, we infer relationships between objects from space and time signals all the time. However, very little is known about the cognitive mechanisms that explain those inferences. A relation can have a cardinality order associated to it. For instance, if we think “Joe is taller than Peter” that’s a relationship of cardinality 2. But if we say “Joe swapped his bicycle from Peter’s scooter” that’s a relationship of cardinality 4. Over the years, cognitive psychologists have developed different theories of relational reasoning. One of my favorites theries, explains relational reasoning in five key principles:
a) The structure of mental models is iconic as far as possible.
b) The logical consequences of relations emerge from models constructed from the meanings of the relations and from knowledge.
c) Individuals tend to construct only a single, typical model.
d) They spontaneously develop their own strategies for relational reasoning.
e) Regardless of strategy, the difficulty of an inference depends on the process of integration of the information from separate premises, the number of entities that have to be integrated to form a model, and the depth of the relation
For knowing so little about relational reasoning, we can measure it very effectively. Raven Progressive Matrices (RPM) are a classic test to quantify relational reasoning. The test is composed of a series of visuospatial tasks that requires participants to identify relevant stimulus features based on the spatial organization of an array of stimuli, and then select the choice stimulus that matches one or more of these identified features.
AI is no strange to relational reasoning. Symbolist models based on logic are inherently relational. Those models define the relations between symbols using the language of logic and mathematics, and then reason about these relations using a multitude of powerful methods, including deduction, arithmetic, and algebra. However, symbolic models are very vulnerable to what is known as the symbol grounding problem which produces large variations in the outputs based on small changes in the inputs. Neural networks have proven to be incredibly robust to input variations and, therefore, combining neural networks and logic has become a very active area of research in the AI space.
The DeepMind research paper proposes a method based on relational networks(RN) augmented with other artifacts in order to infer relationships from an unstructured input such as images or a text dataset. For instance, for a given image input, the RN model should be able to derived relational knowledge such as the following:
In its simplest form, a RN model is expressed by a mathematical composite function:
Don’t feel intimidate by the math nomenclature. The formulate is relatively easy to explain:
a) The Relational Network for O (O is the set of objects you want to learn relations of) is a function fɸ.
b) gθ is another function that takes two objects :oi , and oj. The output of gθ is the ‘relation’ that we are concerned about. In simpler terms, gθ calculates the relationship between two objects.
c) Σ i,j means , calculate gθ for all possible pairs of objects, and then sum them up.
In its simplest form, both gθ and fɸ are multi-layer perceptrons.
Relational networks are not only great at detecting relationships between objects but they are also easily composable into other models. The DeepMind team took advantage of this characteristic to create an architecture that can detect relationships in both images and text datasets.
In the previous architecture, textual questions are processed with an LSTM to produce a question embedding, and images are processed with a CNN to produce a set of objects for the RN. Objects (three examples illustrated here in yellow, red, and blue) are constructed using feature-map vectors from the convolved image. The RN considers relations across all pairs of objects, conditioned on the question embedding, and integrates all these relations to answer the question.
The DeepMind team benchmarked their relational network architecture using a series of datasets optimized for relational reasoning. For image analysis, the researchers used a dataset called CLEVR, which contains images of 3D-rendered objects, such as spheres and cylinders. Each image is associated with a number of questions that fall into different categories. For text-based questions, DeepMind used a dataset called bAbI which includes over 20 tasks, each corresponding to a particular type of reasoning, such as deduction, induction, or counting. Each question is associated with a set of supporting facts. For example, the facts “Sandra picked up the football” and “Sandra went to the office” support the question “Where is the football?”
In the image tests, the DeepMind relational network achieved close to super human performance with an astonishing 96% accuracy which was about 20% superior to other relational models. In the text tests, the DeepMind model succeeded at 18/20 tasks exhibiting a 2.1% margin of error.
The DeepMind research clearly showed that relational networks can become an important building block of deep learning models that require relational reasoning capabilities. The tests demonstrated that relational networks are not only able of develop broad reasoning capabilities but do so at a significant scale.