The goal of Natural Language Processing (NLP) is two-fold: to derive meaning from natural languages, and to generate well-formed and sensible expressions given some semantics. However, the understanding of semantics or meaning relies on the understanding of syntax, which is the set of rules that dictates how to build up complex sentences and ideas from basic parts. The NLP task of unsupervised learning of syntax mirrors the process an infant learns a language. In both cases, the agent primarily observes examples of well-formed languages, which are called positive examples in NLP, in order to derive the rules underlying the production of proper expressions. Many research has been conducted on this task since the 1970s, and a few papers published very recently (see Jin 2019, Kim 2019, and Drozdov 2019 in related work) significantly improved syntax learning results by incorporating neural network architectures into their models.

However, the aforementioned papers only considered the task of syntax acquisition to be a text-only learning task. In infants, the learning of language is accompanied by experiences in other modalities too, like visual, auditory, and even olfactory experiences. As a result, there is no reason to not include these other modalities as part of the learning task. As we will see later, the paper of interest in this blog post, **Visually Grounded Compound PCFG (Zhao 2020)**, explores the incorporation of image semantics in the learning of syntax structure. The paper introduces the titular model, Visually Grounded Compound PCFG (VC-PCFG) which achieves state of the art performance on the syntax acquisition task, outperforming previous models that were trained only on text or trained with visually grounded learning but without Compound PCFG.

Before we dive into the paper, some **context** is needed to understand what Compound PCFG, or Compound Probabilistic Context-Free Grammar is. A plain Context-Free Grammar (CFG) is a type of formal grammar, the goal of which is to “produce” strings of symbols that satisfy certain requirements. Specifically, it generates new “languages” from a “start symbol” by applying to it a set of rules in arbitrary order. Formally, we define an instance of CFG to be a 4-tuple：

*G = *(*V, *Σ*, R, S*)

** V** is a finite set of nonterminal symbols, which are transitional variables that appear in the process of producing strings but are eventually replaced with nonterminal symbols.

**Σ** is a finite set of terminal symbols, which are the only symbols allowed in the final form of the produced strings

** R** is a finite relation in

**, which defines how to replace existing symbol with new symbols. Specifically, it defines how to map a nonterminal symbol to some combination of nonterminal symbols and terminal symbols.**

*V ×*(*V*∪ Σ)**is often called the set of “production rules”.**

*R*** S **is the start symbol, the symbol present in the very beginning before we applied any production rules.

This 4-tuple G uniquely defines a CFG, and a given CFG corresponds to a set of strings possible under production. CFG is a fitting choice for modeling languages because of its recursive nature that closely follows natural languages. As a result, the objective of modeling a natural language can be reduced to the task of finding the correct CFG that generates exactly the set of all sensible sentences in that language, which implies that it neither over-generate (produce sentences that are not legal) nor under-generate (lacks the ability to produce some legal sentences).

Probabilistic CFG (PCFG) extends CFG by associating a non-negative scalar to each production rule in the set ** R**,

**the set of such scalars is denoted 𝝿. To make each of these 𝝿_i into proper probability values, we normalize the value so that the sum of 𝝿_i for rules with the same left-hand-side symbol must sum up to 1. In production time, any non-terminal symbol applies on itself a certain production rule with the probability of 𝝿_i. The addition of probability based rule application further increase the flexibility of CFG in modeling languages.**

However, PCFG’s strong context-free assumption hinders its ability to effectively induct grammar, and as a result, Kim 2019 introduces Compound PCFG (C-PCFG) as a further extension of PCFG. A compound probability distribution is a distribution whose parameters are themselves random variables, and in the case of C-PCFG, the probability distribution 𝝿 is itself a random variable conditioned on a latent variable **z **as such: