Visually Grounded Compound PCFG: New Advancement in Multimodal Natural Language Processing

Cover image for DJ Finite Automata’s song “Context-Free Grammar” on SoundCloud. It’s quite fitting here.

The goal of Natural Language Processing (NLP) is two-fold: to derive meaning from natural languages, and to generate well-formed and sensible expressions given some semantics. However, the understanding of semantics or meaning relies on the understanding of syntax, which is the set of rules that dictates how to build up complex sentences and ideas from basic parts. The NLP task of unsupervised learning of syntax mirrors the process an infant learns a language. In both cases, the agent primarily observes examples of well-formed languages, which are called positive examples in NLP, in order to derive the rules underlying the production of proper expressions. Many research has been conducted on this task since the 1970s, and a few papers published very recently (see Jin 2019, Kim 2019, and Drozdov 2019 in related work) significantly improved syntax learning results by incorporating neural network architectures into their models.

However, the aforementioned papers only considered the task of syntax acquisition to be a text-only learning task. In infants, the learning of language is accompanied by experiences in other modalities too, like visual, auditory, and even olfactory experiences. As a result, there is no reason to not include these other modalities as part of the learning task. As we will see later, the paper of interest in this blog post, Visually Grounded Compound PCFG (Zhao 2020), explores the incorporation of image semantics in the learning of syntax structure. The paper introduces the titular model, Visually Grounded Compound PCFG (VC-PCFG) which achieves state of the art performance on the syntax acquisition task, outperforming previous models that were trained only on text or trained with visually grounded learning but without Compound PCFG.

Before we dive into the paper, some context is needed to understand what Compound PCFG, or Compound Probabilistic Context-Free Grammar is. A plain Context-Free Grammar (CFG) is a type of formal grammar, the goal of which is to “produce” strings of symbols that satisfy certain requirements. Specifically, it generates new “languages” from a “start symbol” by applying to it a set of rules in arbitrary order. Formally, we define an instance of CFG to be a 4-tuple：

G = (V, Σ, R, S)

V is a finite set of nonterminal symbols, which are transitional variables that appear in the process of producing strings but are eventually replaced with nonterminal symbols.

Σ is a finite set of terminal symbols, which are the only symbols allowed in the final form of the produced strings

R is a finite relation in V × (V ∪ Σ), which defines how to replace existing symbol with new symbols. Specifically, it defines how to map a nonterminal symbol to some combination of nonterminal symbols and terminal symbols. R is often called the set of “production rules”.

S is the start symbol, the symbol present in the very beginning before we applied any production rules.

This 4-tuple G uniquely defines a CFG, and a given CFG corresponds to a set of strings possible under production. CFG is a fitting choice for modeling languages because of its recursive nature that closely follows natural languages. As a result, the objective of modeling a natural language can be reduced to the task of finding the correct CFG that generates exactly the set of all sensible sentences in that language, which implies that it neither over-generate (produce sentences that are not legal) nor under-generate (lacks the ability to produce some legal sentences).

Probabilistic CFG (PCFG) extends CFG by associating a non-negative scalar to each production rule in the set R, the set of such scalars is denoted 𝝿. To make each of these 𝝿_i into proper probability values, we normalize the value so that the sum of 𝝿_i for rules with the same left-hand-side symbol must sum up to 1. In production time, any non-terminal symbol applies on itself a certain production rule with the probability of 𝝿_i. The addition of probability based rule application further increase the flexibility of CFG in modeling languages.

However, PCFG’s strong context-free assumption hinders its ability to effectively induct grammar, and as a result, Kim 2019 introduces Compound PCFG (C-PCFG) as a further extension of PCFG. A compound probability distribution is a distribution whose parameters are themselves random variables, and in the case of C-PCFG, the probability distribution 𝝿 is itself a random variable conditioned on a latent variable z as such:

where p(z) is a predetermined prior distribution of the latent variable z, and g_r(z;θ) is a neural-network-based distribution parameterized by θ that produces a proper set of production rule probabilities 𝝿_r from z. Depending on what kind of rule r is, g_r(z;θ) has three forms:

where u are a parameter vectors, w are symbol embeddings, [ ; ] indicates vector concatenation, and f are neural networks. In plain words, for each kind of production rule, the neural network f creates an encoding from both the symbol embedding w as well as the randomly sampled latent variable z. This encoding is then multiplied with a rule-specific parameter vector u, then normalized in a softmax fashion into a proper probability value.

There are two parts to the loss function optimized in this paper. The first is an alignment loss between a set of image representations and a corresponding set of sentences in a contrastive learning framework, which is a paradigm of self-supervised learning where the objective is to contrast between two examples and to tell if they are similar or dissimilar. Specifically, this part of the loss is defined as:

where s(v_i, w_i) is the loss of aligning v_i and w_i and is defined as:

which is the expected image-sentence matching loss under a tree distribution pθ(t|w). Here, h(c, v) is the hinge loss of aligning the unlabeled constituent c and the image v and is defined as:

where [·]+ = max(0, ·), ε is a positive margin, m(c, v) = cos(c, v) is the matching function measuring similarity between the constituent representation c and the image representation v. Note that the expectation is taken with respect to negative examples c’ and v’.

The right side of the equation for s(v_i, w_i) can be further broken down into:

where the term p(c|w), the conditional probability of the span c given w, emerges. This term can be efficiently computed with the inside algorithm and automatic differentiation discussed in Eisner’s 2016 paper.

The paper then goes into great details about how to construct representations for span largely independent of the parser. For the sake of time, this will not be covered here in this blog post. Interested readers can refer to the VC-PCFG paper itself. We will go straight to the next important innovation of this paper.

The second part of the loss is a classic log-likelihood of text data. In other words, we are optimizing the C-PCFG parameters so that it is maximally likely for the training data to appear. Specifically, we are optimizing the Evidence Lower Bound (ELBO):

Previously, with the parser only optimized by matching image and constituents, the = parser would only focus on learning simple and local constituents such as noun phrases. In addition, many longer phrases, such as a verb phrase, have no precise grounding in the image, and as a result does not yield clear learning signals. Shown in the previous paper on visual grounding(Shi 2019), parser trained with only the contrastive loss has an extremely skewed performance on different constituency categories. Adding this new language modeling objective to the loss function helps the parser to learn the part not covered by the image-text matching.

To summarize, the overall loss function is the sum of both the contrastive loss and the language modeling loss.

After optimizing the model parameter θ with the above loss, we can seek the most probable parse tree t* of input sentence w by finding the argmax over the entire distribution of latent variable z. Specifically:

As shown in the equation, this process does not need access to visual samples v paired with the caption. However, as discussed below, jointly training the C-PCFG with images results in a better model.

The VC-PCFG model outperforms all baselines according to both corpus-level F1 score and sentence-level F1 score. In particular, VC-PCFG achieves a much higher F1 score (+5.7%) compared to the baseline of a C-PCFG trained only on captions without images.

The novel model proposed in this paper, Visually Grounded Compound PCFG, achieved significant improvement over previous models trained with only text, or jointly trained with text and image but is not based on C-PCFG. This highlights the importance of incorporating data from other modality in the training of NLP models, even if the end product does not depend on other modalities at all. This paper opens up further research into multi-modal learning of natural language structures.

Visually Grounded Neural Syntax Acquisition (Haoyue Shi, Jiayuan Mao, Kevin Gimpel, and Karen Livescu. 2019)

This paper can be seen as a spiritual precursor of the VC-PCFG paper we discussed above; one main focus in VC-PCFG is on how their model solves the problems and limitations of the previous iteration. This paper introduces Visually Grounded Neural Syntax Learning (VG-NSL), a neural network-based model that learns to acquire the correct constituency parsing by processing both sentences and associated images. The model consists of two modules: a textual module that generate a constituency parse tree, and a visual-semantic module for embedding images along with the parsed constituents. The optimization is done by alternatingly fixing one module and updating the other module using the REINFORCE algorithm (Williams, 1992).

Compound Probabilistic Context-Free Grammars for Grammar Induction ( Yoon Kim, Chris Dyer, and Alexander Rush. 2019)

This paper introduces the model of Compound PCFG, which serves the modeling basis of this entire paper. The authors present C-PCFG as an improvement over PCFG, as the later has been shown to be difficult to optimize with direct methods like the EM algorithm. Broadly speaking, C-PCFG is a a continuous mixture of PCFGs marginalizing over a latent parameter z as well as the embedding of input sentence. This flexible distribution over probability distributions of production rule application makes C-PCFG significantly easier to optimize.

Unsupervised Learning of PCFGs with Normalizing Flow ( Lifeng Jin, Finale Doshi-Velez, Timothy Miller, Lane Schwartz, and William Schuler. 2019)

This paper proposes a neural PCFG inducer that takes advantage of contextual information. It is in the same line of thinking as VC-PCFG, except it is trying to extract semantic information from the text alone. The proposed model employs context embeddings in a normalizing flow model, in order to extend PCFG induction to use semantic and morphological information. In addition, this paper incorporates linguistically motivated similarity penalty and categorical distance constraints as regularization.

Unsupervised Latent Tree Induction with Deep Inside-Outside Recursive Auto-Encoders ( Andrew Drozdov, Patrick Verga, Mohit Yadav, Mohit Iyyer, and Andrew McCallum. 2019)

The authors present Deep Inside-Outside Recursive Autoencoder (DIORA), an unsupervised method for inducing syntactic trees and representation of constituent spans. This paper is another text-only approach to unsupervised syntax learning and constituency parsing. Besides not having visual grounding, this method is different from VC-PCFG in that it seeks to find constituency representation in the form of latent tree chart, instead of a PCFG.

Incorporating Visual Semantics into Sentence Representations within a Grounded Space (Patrick Bordes, Eloi Zablocki, Laure Soulier, Benjamin Piwowarski, Patrick Gallinari. 2019)

This paper also seeks to ground language understanding in the visual word. However, it is different from VC-PCFG in that it attempts to learn sentence embedding, rather than a structural representation of sentences like a C-PCFG. Specifically, the authors proposed to incorporate visual semantics through an intermediate space in which objectives are learned. Several transfer tasks have been studied to show the advantage of this new approach over previous grounding methods.

Footer