## Zhemin Zhu defends PhD thesis on Co-occurrence Rate Networks

**Co-occurrence Rate Networks: Towards separate training for undirected graphical models**

by Zhemin Zhu

Dependence is a universal phenomenon which can be observed everywhere.
In machine learning, probabilistic graphical models (PGMs) represent dependence
relations with graphs. PGMs find wide applications in natural language
processing (NLP), speech processing, computer vision, biomedicine, information
retrieval, etc. Many traditional models, such as hidden Markov models
(HMMs), Kalman filters, can be put under the umbrella of PGMs. The central
idea of PGMs is to decompose (factorize) a joint probability into a product of
local factors. Learning, inference and storage can be conducted efficiently over
the factorization representation.

Two major types of PGMs can be distinguished: (i) Bayesian networks
(directed graphs), and (ii) Markov networks (undirected graphs). Bayesian networks
represent *directed dependence* with *directed edges*. Local factors of Bayesian
networks are *conditional probabilities*. Directed dependence, directed edges
and conditional probabilities are all *asymmetric* notions. In contrast, Markov
networks represent *mutual dependence* with *undirected edges*. Both of mutual
dependence and undirected edges are *symmetric* notions. For general Markov
networks, based on the Hammersley–Clifford theorem, local factors are *positive
functions* over maximum cliques. These local factors are explained using
intuitive notions like ‘compatibility’ or ‘affinity’. Specially, if a graph forms
a clique tree, the joint probability can be reparameterized into a junction tree
factorization.

In this thesis, we propose a novel framework motivated by the Minimum
Shared Information Principle (MSIP):
*We try to find a factorization in which the information shared between factors is
minimum. In other words, we try to make factors as independent as possible.*

The benefit of doing this is that we can train factors separately without paying
a lot of efforts to guarantee consistency between them. To achieve this goal,
we develop a theoretical framework called co-occurrence rate networks (CRNs)
to obtain such a factorization. Briefly, given a joint probability, the CRN
factorization is obtained as follows. We first strip off singleton probabilities from
the joint probability. The quantity left is called co-occurrence rate (CR). CR is
a symmetric quantity which measures mutual dependence among variables
involved. Then we further decompose the joint CR into smaller and indepen dent
CRs. Finally, we obtain a CRN factorization whose factors consist of all
singleton probabilities and CR factors. There exist two kinds of independencies
between these factors: (i) a singleton probability is independent (Here
independent means two factors do not share information.) of other singleton
probabilities; (ii) a CR factor is independent of other CR factors conditioned by
singleton probabilities. Based on a CRN factorization, we propose an efficient
two-step separate training method: (i) in the first step, we train a separate
model for each singleton probability; (ii) given singleton probabilities, we train
a separate model for each CR factor. Experimental results on three important
natural language processing tasks show that our separate training method is
two orders of magnitude faster than conditional random fields, while achieving
competitive quality (often better on the overall quality metric F1).

The second contribution of this thesis is applying PGMs to a real-world NLP
application: open relation extraction (ORE). In open relation extraction, two
entities in a sentence are given, and the goal is to automatically extract their
relation expression. ORE is a core technique, especially in the age of big data,
for transforming unstructured information into structured data. We propose
our model SimpleIE for this task. The basic idea is to decompose an extraction
pattern into a sequence of simplification operations (components). The benefit
by doing this is that these components can be re-combined in a new way to
generate new extraction patterns. Hence SimpleIE can represent and capture
diverse extraction patterns. This model is essentially a sequence labeling model.
Experimental results on three benchmark data sets show that SimpleIE boosts
recall and F1 by at least 15% comparing with seven ORE systems.

As tangible outputs of this thesis, we contribute open source implementations
of our research results as well as an annotated data set.