When Turing Award Honoree Dr. Geoffrey Hinton speaks, the AI community listens. Last week, Hinton tweeted, “Finding the natural parts of an object and their intrinsic coordinate frames without supervision is a crucial step in learning to parse images into part-whole hierarchies. If we start with point clouds, we can do it!”
The comments came with the publication of Canonical Capsules: Unsupervised Capsules in Canonical Pose, a new paper from Hinton and a team of researchers at University of British Columbia, University of Toronto, Google Research and University of Victoria, that proposes an architecture for unsupervised learning with 3D point clouds based on capsules.
But what exactly is a “capsule”? Hinton and fellow University of Toronto researchers Alex Krizhevsky and Sida D. Wang introduced the concept in their 2011 paper Transforming Auto-encoders. The work suggested convolutional neural networks might be “misguided” in what they were trying to achieve in computer vision, and proposed the use of local “capsules” that perform complex internal computations on their inputs and encapsulate the results into highly informative outputs. “Each capsule learns to recognize an implicitly defined visual entity over a limited domain of viewing conditions and deformations and it outputs both the probability that the entity is present within its limited domain and a set of ‘instantiation parameters’ that may include the precise pose, lighting and deformation of the visual entity relative to an implicitly defined canonical version of that entity.”
Google Brain research scientist Sara Sabour elaborated on the idea in the CVPR 2019 Tutorial Capsule Networks for Computer Vision. In the context of capsule networks, any group of semantically meaningful neurons in the artificial neural networks could be a capsule. Sabour has been working with Professor Hinton on capsule networks that use geometric relationships between parts to reason about objects.
As Synced previously reported, capsules are outstandingly good at understanding and encoding nuances, akin in this regard to modules in human brains. A capsule system understands an object by interpreting the organized set of its interrelated parts geometrically. Since these geometric relationships remain intact, a system can rely on them to identify objects even if the viewpoint changes, i.e., translation invariance.
With the constant introduction of new techniques such as dynamic routing and EM-algorithms as potential capsule architectures, capsule networks have been leveraged in applications that include medical imaging, language understanding, and even those that involve 3D input data. In another capsule-related paper that Hinton co-authored, Stacked Capsule Auto-Encoders, researchers show that “capsule-style reasoning is effective as far as primary capsules can be trained in an unsupervised fashion.” The study used an unsupervised version of a capsule network, where a neural encoder trained through backpropagation looks at all image parts to infer the presence and poses of object capsules.
The researchers explain that when training 3D deep representation through capsule networks, the scene is perceived via its decomposition into part hierarchies, and each part is represented with “pose” and “descriptor” parts.
- The capsule pose specifies the frame of reference of a part, and hence should be transformation equivariant
- The capsule descriptor specifies the appearance of a part, and hence should be transformation invariant
In the new paper, the researchers propose a capsule architecture trained in an unsupervised fashion — by only observing pairs of randomly rotated 3D point clouds of the same object. To satisfy capsule invariance/equivariance properties in their Canonical Capsules method, the team used a K-fold decomposition that estimates primary capsules whose descriptors are invariant to rigid transformations and whose poses are transformation equivariant.
Google Brain Staff Research Scientist Andrea Tagliasacchi, one of the paper authors, says the work realizes the concept of a ‘mental picture’ for unsupervised 3D deep learning, tweeting, “What can it do? This representation enables state-of-the-art results across a number of applications, such as canonicalization (i.e. registration), reconstruction (i.e. auto-encoding), as well as unsupervised classification!!!”
The paper Canonical Capsules: Unsupervised Capsules in Canonical Pose is on arXiv, and researchers will release the code and dataset soon.