qual.tex

% \usepackage{tipa}
% \newcommand{\bibcorporate}[1]{#1} % stolen from Erik Meijer's apacite.tex
\documentclass[11pt]{article}
\usepackage{setspace}
\usepackage{graphicx}
\usepackage[all]{xy}

% \usepackage{acl07}
\usepackage{robbib}
\title{Comparison of Phonological and Syntactic Distance Measures}
\author{Nathan Sanders}
\begin{document}
\maketitle
\doublespacing
\section{Introduction}
This paper compares phonology and syntax distances in dialectometry,
using computational methods as a basis. Computational methods have
come to dominate dialectometry, but they are limited in focus compared
to previous work; most have explored phonological distance only, while
earlier methods integrated phonological, lexical, and syntactic data.

However, the isogloss bundles that were the primary product of
pre-computational methods do not adequately capture gradient
generalizations; irregular boundaries are not useful for this purpose
and have to be discarded. Recent methods are mathematically
sophisticated, capable of identifying dialect gradients. The purpose
of this work is to integrate phonological and syntactic data in the
way that early dialectology did, but use the computational methods
that have previously only been used to analyze one area of linguistics
at a time.

To do this, I measured phonological distance using the Survey of
English Dialects (SED) \cite{orton63} interview data for phonology,
and the International Corpus of English (ICE) \cite{nelson02} speech
data for syntax. I used Levenshtein distance \cite{lev65} with
phonological features \cite{nerbonne97} for phonological distance, and
a permutation test based on Kessler's $R$ \cite{kessler01} for
syntactic distance \cite{nerbonne06}. I compared the two results over
the nine Government Office Regions of England and compared the
correlation and clustering of the results.

The results show how phonology and syntax contribute to dialect
boundaries, and whether these boundaries reinforce each other. If the
two distances do not contribute to the same boundaries, new dialect
areas may be apparent that were not visible in previous phonology-only
analyses.

\section{Traditional Approaches}
\subsection{S\'eguy}
Measurement of linguistic similarity has always been a part of
linguistics. However, until \namecite{seguy73} dubbed a new set of
approaches `dialectometry', these methods lagged behind the rest of
linguistics in formalization. S\'eguy's quantitative analysis
of Gascogne French, while not aided by computer, was the predecessor
of more powerful statistical methods that essentially required the use
of computer as well as establishing the field's general dependence on
well-crafted dialect surveys that divide the incoming data along
traditional linguistic boundaries: phonology, morphology, syntax, etc.
This makes both collection and analysis easier, although it requires
much more work when trying to produce a complete picture of dialect
variation.

The project to build the Atlas Linguistique et Ethnographique de la
Gascogne, which S\'eguy directed, collected data in a dialect survey
of Gascogne which asked speakers questions informed by different areas
of linguistics. For example, the pronunciation of `dog' (``chien'')
was collected to measure phonological variation. It had two common
variants and many other rare ones: [k\~an], [k\~a], as well as [ka],
[ko], [kano], among others. These variants were, for the most part,
% or hat "chapeau": SapEu, kapEt, kapEu (SapE, SapEl, kapEl
known by linguists ahead of time, but their exact geographical
distribution was not.

The atlases, as eventually published, contained not only annotated
maps, but some analysis as well. This analysis was what S\'eguy named
dialectometry. Dialectometry differs from previous attempts to find
dialect boundaries in the way it combines the information from the
dialect survey. Previously, dialectologists looked for isogloss
boundaries for individual items. A dialect boundary is generated when
enough individual isogloss boundaries coincided. However, there is so
much individual variation that only major dialect boundaries can
be captured this way.

S\'eguy reversed the process. He first combined the survey data to get
a numeric score between each site. Then he posited dialect boundaries
where large distances resulted between sites. The difference is
important, because a single numeric score is dramatically easier to
analyze than hundreds of individual boundaries. The outcome is that
much more subtle dialect boundaries are visible this way; where before
one saw only a jumble of conflicting boundary lines, now one sees
smaller, but consistent numerical differences separating regions. {Dialectometry
  enables classification of gradient dialect boundaries, since now one
can distinguish weak and strong boundaries. Previously, weak
boundaries were too uncertain.}

However, S\'eguy's method of combination was simple both
linguistically and mathematically. When comparing two sites, any
difference in a response would be counted as 1. Only identical
responses counted as a distance of 0. Words were not analyzed
phonologically, nor were responses weighted by their relative amount
of variation. Finally, only geographically adjacent sites were
compared. This is a reasonable restriction, but later studies were
able to lift it because of the availability of greater computational
power. Work following S\'eguy's would improve on both aspects. To wit,
Ger\v{s}i\'c linguistically and Goebl mathematically.

\subsection{Ger\v{s}i\'c}
Just before S\'eguy's Atlas was finally published in 1973,
\namecite{gersic71} proposed a computational method for measuring
distance between two phonological segments. Segments are specified as
a vector of numeric features---consonants have a set of five features
while vowels have a different set of five features.  For example, the
third consonantal feature, voice, give $1$ to a [+voice] segment and
$0$ to a [-voice] segment. Gersic then gives a function $d$ that sums
the features to find the distance between segments:
\[ d(i,j) = \sum_{k=1} |a_{i_k} - a_{j_k}|\] where $a_i$ is the first
segment being compared and $a_j$ the second. However, the equation is
easier to understand as ``count the
number of features with values that differ between segments.'' For
example, [k] compared to [g] would differ for one feature, voice, so
$d(\textrm{[g]},\textrm{[k]}) = 1$. Unfortunately, $d(\textrm{[a]},
\textrm{[k]})$ is not defined.

Although the S\'eguy's phonological analysis of the Linguistic Atlas
of France did not use Ger\v{s}i\'c's proposal, the more complex
analysis it required became feasible as more work shifted to the computer.

\subsection{Goebl}

Later, Hans Goebl emerged as a leader in the field of dialectometry,
formalizing the aims and methods of dialectometry. His primary
contribution was development of various methods to combine individual
distances into global distances and from there global clusters. These
methods were more sophisticated mathematically than previous
dialectometry and operated with any features extracted from the data. His
analyses have mostly used the Atlas Linguistique de Fran\c{c}ais.

\namecite{goebl06} provides a summary of his work. Most relevant for
this paper are the measures Relative Identity Value and Weighted
Identity Value. They are general methods that are the basis for nearly
all subsequent fine-grained dialectometrical analyses. They have three
important properties. First, they are independent of the source
data. They can operate over any linguistic data for which they are
given a feature set, such as the one proposed by Ger\v{s}i\'c for
phonology. Second, they can compare data even for items that do not
have identical feature sets. This improves on Ger\v{s}i\'c's $d$,
which cannot compare consonants and vowels. Third, they can compare
data sets that are missing some entries. This improves on S\'eguy's
analysis by providing a principled way to handle missing survey
responses.

Relative Identity Value, when comparing any two items, counts the
number of features which share the same value and then discounts
(lowers) the importance of the result by the number of unshared
features. The result is a single percentage that indicates
relative similarity. Calculating this distance between all pairs
of items in two regions produces a matrix which can be used for
clustering or other purposes. Note that the presentation below splits
Goebl's original equations into more manageable pieces; the high-level
equation for Relative Identity Value is:

\begin{equation}
  \frac{\textrm{identical}_{jk}} {\textrm{identical}_{jk} - \textrm{unidentical}_{jk}}
\label{riv}
\end{equation}
For some items being compared $j$ and $k$. In this case
\textit{identical} is
\begin{equation}
  \textrm{identical}_{jk} = |f \in \textrm{\~N}_{jk} : f_j = f_k|
\end{equation}
where $i$ is each feature shared by $j$ and $k$ (called
$\textrm{\~N}_{jk}$). \textit{unidentical} is defined similarly, except
that it counts all features N, not just the shared features
$\textrm{\~N}_{jk}$.
\begin{equation}
  \textrm{unidentical}_{jk} = |f \in \textrm{N} : f_j \neq f_k|
\end{equation}

Weighted Identity Value is a refinement of Relative Identity
Value. This measure stems defines some differences as more
important than others. In particular, more information arises from
feature values that only happen a few times rather than from those values
that characterize a large number of the items being studied.  This
idea shows up later in the normalization of syntax distance given by
\namecite{nerbonne06}.

The mathematical implementation of this idea is fairly simple. Goebl
is interested in feature values that occur only a few times. If a
feature has some value that is shared by all of the items, then all
items belong to the same group. This feature value provides {\it no}
useful information for distinguishing the items.  The situation
improves if all but one item share the same value for a feature; at
least there are now two groups, although the larger group is still not
very informative.  The most information is available if each item
being studied has a different value for a feature; the items fall
trivially into singleton groups, one per item.

Equation \ref{wiv-ident} works by discounting
the \textit{identical} count from equation \ref{riv} by
the amount of information that feature value conveys. The
amount of information, as discussed above, is based on the number of
items that share a particular value for a feature. If all items share
the same value for some feature, then \textit{identical} will be discounted all the
way to zero--the feature conveys no useful information.
Weighted Identical Value's equation for \textit{identical} is
therefore
\begin{equation}
  \textrm{identical} = \sum_f \left\{
  \begin{array}{ll}
    0 & \textrm{if} f_j \neq f_k \\
    1 - \frac{\textrm{agree}f_{j}}{(Ni)w} & \textrm{if} f_j = f_k
  \end{array} \right.
\label{wiv-ident}
\end{equation}

\noindent{}The complete definition of Weighted Identity Value is
\begin{equation} \sum_i \frac{\sum_f \left\{
  \begin{array}{ll}
    0 & \textrm{if} f_j \neq f_k \\
    1 - \frac{\textrm{agree}f_j} {(Ni)w} & \textrm{if} f_j = f_k
\end{array} \right.}
  {\sum_f \left\{
  \begin{array}{ll}
    0 & \textrm{if} f_j \neq f_k \\
    1 - \frac{\textrm{agree}f_j} {(Ni)w} & \textrm{if} f_j = f_k
    \end{array} \right. - |f \in \textrm{N} : f_j \neq f_k|}
  \label{wiv-full}
  \end{equation}

  \noindent{}where $\textrm{agree}f_{j}$ is the number of candidates that agree
  with item $j$ on feature $f$ and $Ni$ is the total number of
  candidates ($w$ is the weight, discussed below). Because of the
  piecewise definition of \textit{identical}, this number is always at
  least $1$ because $f_k$ agrees already with $f_j$. The effect of
  this equation is to take the count of shared features and weight
  them by the size of the sharing group. The features that are shared
  with a large number of other items get a larger fraction of the normal
  count subtracted.

  For example, let $j$ and $k$ be sets of productions for the
  underlying English segment /s/. The allophones of /s/ vary mostly on the feature
  \textit{voice}. Seeing an unvoiced [s] is less ``surprising'' than
  seeing a voiced [z] for /s/, so the discounting process should
  reflect this. For example, assume that an English corpus contains 2000
  underlying /s/ segments. If 500 of them are realized as [z], the
  discounting for \textit{voice} will be as follows:

  \begin{equation}
    \begin{array}{c}
      identical_{/s/\to[z]} = 1 - 500/2000 = 1 - 0.25 = 0.75 \\
      identical_{/s/\to[s]} = 1 - 1500/2000 = 1 - 0.75 = 0.25
    \end{array}
    \label{wiv-voice}
  \end{equation}

  Each time /s/ surfaces as [s], it only receives 1/4 of a point
  toward the agreement score when it matches another [s]. When /s/
  surfaces as [z], it receives three times as much for matching
  another [z]: 3/4 points towards the agreement score. If the
  alternation is even more weighted toward faithfulness, the ratio
  changes even more; if /s/ surfaces as [z] only 1/10 of the time,
  then [z] receives 9 times more value for matching than [s] does.

  The final value, $w$, which is what gives the name ``weighted
  identity value'' to this measure, provides a way to control how much
  is discounted. A high $w$ will subtract more from uninteresting
  groups, so that \textit{voice} might be worth less than
  \textit{place} for /t/ because /t/'s allophones vary more over
  \textit{place}. In equation \ref{wiv-voice}, $w$ is left at 1 to
  facilitate the presentation.

% NOTE: This is a lot like avoiding excessive neutralization in a
% paradigm---you don't want to overdo neutralization because otherwise
% the underlying form becomes quite distant from the surface form.

\section{Computational Approaches}

It is at this point that the two types of analysis, phonological and
syntactic, diverge. Although Goebl's techniques are general enough to
operate over any set of features that can be extracted, better results
can be obtained by specializing the general measures above to take
advantage of properties of the input.  Specifically, the application
of computational linguistics to dialectometry beginning in the 1990s
introduced methods from other fields. These methods, while generally
giving more accurate results quickly, are tied to the type of data on
which they operate.

\subsection{Phonological Distance}

Phonological distance is a good example; algorithms that can take
advantage of the ordered, regular properties of phonology will produce
better results in less time. \namecite{kessler95} introduced
Levenshtein distance \cite{lev65} to dialectometry in order to
determine distance between individual words. Then he summed the
individual word distances to find distances between dialect areas.

Levenshtein distance is a simple idea---count the
number of differences between two strings. The intent is the same as
Goebl's Relative Identity Value, using single characters of a word as
features. However, Levenshtein distance uses the property that strings
are an ordered sequence of characters. Levenshtein's algorithm solves
the non-trivial problem of determining the best set of correspondences
between two strings. It tries all possible alignments and
remembers the best ones using a dynamic programming algorithm. The
alignment generated by the algorithm is guaranteed to have the most
possible correspondences between the two input strings.
Levenshtein distance is in principle applicable to any
ordered sequence and as such has been used in many fields
\cite{sankoff83}. However, its discovery of correspondences makes it
particularly suited for phonology.

The Levenshtein distance algorithm models alignment as the series of
changes necessary to convert the first string to the second. This
means that input to the algorithm includes the cost of changes in
addition to the two strings. Three operations are typically
specified: insertion, deletion, and substitution. Others have been
proposed for use in phonology; see \cite{kondrak02} for examples and
analysis of in terms of efficiency and context-sensitivity. Given two
strings and three operations, the algorithm calculates the number of
each operation necessary to convert the first string to the
second. Finding the lowest total operation cost produces the highest
number of correspondences.

In a character-based model, insertion and deletion are the primitive operations,
with a cost of one each. Substitution is one insertion and one
deletion, giving it a cost of two. However, substitution of a character
for itself changes nothing and thus has zero cost.
%Given these functions, the
%Levenshtein algorithm will return the minimum number of insertions and
%deletions necessary for transforming the source to the target.

The formal definition of the functions $ins$, $del$, and $sub$
for characters is
\begin{equation}
\begin{array}{l}
   ins(t_j) = 1 \\
   del(s_i) = 1 \\
   sub(s_i,t_j) = \left\{
     \begin{array}{ll}
       0 & \textrm{if $s_i=t_j$} \\
       2 & \textrm{otherwise}
     \end{array} \right.

   \end{array}
\end{equation}

Now for each character $s_i \in S$ and $t_j \in T$ for any string $S$ and $T$,
\begin{equation}
  levenshtein(s_i,t_j) = min \left(
  \begin{array}{l}
   ins(s_i)+levenshtein(s_{i-1},t_j), \\
 del(t_j)+levenshtein(s_i,t_{j-1}), \\
 sub(s_i,t_j)+levenshtein(s_{i-1},t_{j-1})
   \end{array} \right)
   \label{levequation}
\end{equation}
The total distance between $S$ and $T$ is $levenshtein(S_{|S|},T_{|T|})$.

This recursive algorithm finds the distance between two strings in
terms of the cheapest of three operations on their substrings.
For example, with this algorithm, the distance between \textit{sick} and
\textit{tick} is two---a substitution of t for s. The distance between
\textit{dog} and \textit{dog} is zero. The distance between
\textit{dog} and \textit{cat} is six because none of the characters
correspond, so three insertions and three deletions are needed to
convert \textit{dog} to \textit{cat}. Not all calculations are so
easy. The distance between \textit{realty} and \textit{reality} is a
single insertion. The distance between \textit{nonrelated} and
\textit{words} is 9, not $15 =10 + 5$, because the characters `o-r-d'
are in correspondence.

\subsubsection{Heeringa}
\label{levmethod}
Variations on this theme have been explored and best described by
\namecite{heeringa04}. Starting with \cite{nerbonne97}, he augmented
the distance definition so that substitution cost was based on the
number of phonological features that differ between segments. This provides more
precision and is based on phonological theory. He also tried
weighting features by information gain.  In his dissertation,
\cite{heeringa04}, he tried a more phonetic method when he defined the
substitution cost between segments as the total distance between the
first two formants, as measured in barks. However, because he did not
have access to the original speech of the dialect data he was
measuring, this substitution cost was uniform across all instances of
a particular segment.

The most direct way to refine Levenshtein distance to take advantage
of linguistic knowledge is by changing the definitions of $ins$,
$del$, and $sub$ to take into account phonetic and phonological
properties of segments. When segments are treated as feature bundles
instead of merely being atomic, \namecite{nerbonne97} propose that the
substitution distance between two segments simply be the number of
features that are different. Two identical segments will therefore
have a substitution distance of zero; segments phonetically similar
will have a small distance. For example, [k] and [g] would have a
distance of 1 in this system:

\[ (velar=velar \land{} \ldots \land{} +voice\neq -voice) \to (0
+ \ldots + 1) = 1\]

\subsubsection{Shackleton}
For his analysis of English dialects, \namecite{shackleton07} uses
numeric rather than binary features. This allows for relative
weighting of features, so that [k] and [t] are more similar than [k]
and [p]. Shackleton's feature set is very similar to the one proposed by
\namecite{gersic71}, although he includes some features specifically
designed for variation known to exist in English dialects. Most
importantly, he defines different features for vowels, consonants and
vowel-following rhotics. The features are given in table
\ref{featureset}.

\begin{table}
\begin{tabular}{c|lr}
Vowel & Height & 1.0 -- 7.0 \\
  & Backing & 1.0 -- 3.0 \\
  & Rounding & 1.0 -- 2.0 \\
  & Length & 0.5 -- 2.0 \\ \hline
Consonant & Fricative & 0.0 -- 1.0 \\
  & h/wh & 0.0 -- 1.0 \\
  & Glottal  Stop &0.0 -- 1.0 \\
  & Velar & 0.0 -- 2.0 \\
  & Other & 0.0 -- 1.0 \\ \hline
Rhotic & Place & 1.0 -- 3.0 \\
  & Manner & 0.0 -- 4.0 \\
\end{tabular}
\caption{Feature Set used by Shackleton (2007)}
\label{featureset}
\end{table}

Although it increases precision, feature-based substitution causes a
number of complications. The first is that substitution distance
($sub$) must distinguish between vowels, consonants and rhotics, since
they do not share features. Substitution distance must instead be the
number of unshared features, so vowels and consonants always have a
distance of 9, because the feature set in table \ref{featureset} gives
consonants 5 features and vowels 4. As before, if both segments are
are of the same type, the individual feature differences are summed.  For
example, [a] and [e] has the following difference: \[ |1-5| + |2-1| +
|1-1| + |1-1| = 5 \]

The second complication is obtaining definitions for $ins$ and $del$
once $sub$ is defined. It would be best to retain the original
proportions---substitution should cost twice as much as insertion and
deletion. To deal with substitution's variable cost, then, insertion and
deletion should be averages. To find the average substitution cost, one can
 take the average cost of substituting every character
for every other character. Then $ins$ and $del$ return half of this
average. With these three functions defined as in equation
\ref{indelsub}, the table-based algorithm
given above can combine feature distances to find the minimum word distance.

\begin{equation}\begin{array}{l}
   ins(t_j) = \overline{sub} / 2 \\
   del(s_i) = \overline{sub} / 2 \\
   sub(s_i,t_j) = \left\{
     \begin{array}{ll}
       |s_i|+|t_j| & \textrm{if $C(s_i) \ne C(t_j)$} \\
       \sum_{f_s,f_t \in s_i,t_j} |f_s - f_t| & \textrm{otherwise}
     \end{array} \right.

   \end{array}
\label{indelsub}
\end{equation}

Here, $f_s$ and $f_t$ are the features of the segments $s_i$ and $t_j$
being compared, $C$ is the consonantal property of a segment, and
$\overline{sub}$ is the average substitution cost for all segments,
defined in equation \ref{avgsub}.

\begin{equation}
\overline{sub} = \sum_{w_i,w_j \in W_1\times W_2}(\sum_{s_i,s_j \in w_i
  \times w_j}sub(w_i,w_j) / |w_i\times w_j|) / |W_1\times W_2|
\label{avgsub}
\end{equation}

In this equation, $W$ is the total set of words in some data set. All
the different substitutions of segments for all the different
combination of words in the data are tried and then averaged.

\subsection{Syntactic  Distance}
% 1st sentence is awkward (and 2nd too a little)
Recently, computational dialectometry has expanded to analysis of
syntax as well. The first work in this area was \quotecite{nerbonne06}
analysis of Finnish L2 learners of English, followed by
\quotecite{sanders07} analysis of British dialect areas. Syntax
distance must be approached quite differently than phonological
distance. Syntactic data is extractable from raw text, so it is much
easier to build a syntactic corpus. But this implies an associated
drop in manual linguistic processing of the data. As a result, the
principle difference between present phonological and syntactic
corpora is that phonology data is word-aligned, while syntax data is
not sentence-aligned.

\subsubsection{Nerbonne and Wiersma}
\label{nerbonne06}

Due to the lack of alignment between the
larger corpora available for syntactic analysis, a statistical
comparison of differences is more appropriate than the simple
symbolic approach possible with the word-aligned corpora used in
phonology. Because of the lack of alignment, a syntactic distance
measure will have to use counting as its basis by default.

\namecite{nerbonne06} was an early method proposed for syntactic
distance.  It models syntax by part-of-speech (POS) trigrams and uses
the trigram types in a permutation test of significance. This method was
extended by \namecite{sanders07}, who used \quotecite{sampson00}
leaf-ancestor paths as the basis for building the model instead.

The heart of the measure is simple: the difference in type counts
between the combined types of two corpora. \namecite{kessler01}
originally proposed this measure, the {\sc Recurrence}
metric ($R$):

\begin{equation}
R = \Sigma_i |c_{ai} - c_{bi}|
\label{rmeasure}
\end{equation}

\noindent{}Given two corpora $a$ and $b$, $c_a$ and $c_b$ are the type
counts. $i$ ranges over all types, so $c_{ai}$ and $c_{bi}$ are the
type counts for type $i$.  $R$ is designed to represent the amount of
variation exhibited by the two corpora while allowing the contribution
of individual types to be extracted simply.

To account for differences in corpus size, repeated sampling is
used. In addition, the samples are normalized to account for
differences in sentence length.  Unfortunately, even normalized, the
measure doesn't indicate whether its results are significant; a
permutation test is needed for that.

% Other ideas include training a
% model on one area and comparing the entropy (compression) of other
% areas. At this point it's unclear whether this would provide a
% comparable measure, however.

\subsubsection{Language models}
Part-of-speech (POS) trigrams are quite easy to obtain from a syntactically
annotated corpus. \namecite{nerbonne06} argue that POS trigrams
can accurately represent at least the important parts of syntax,
similar to the way chunk parsing can capture the most important
information about a sentence. POS trigrams can either be generated by
a tagger as Nerbonne and Wiersma did, or taken from the leaves of
the trees of a parsed corpus.

On the other hand, it might be better to directly represent the upper
structure of trees. \quotecite{sampson00} leaf-ancestor paths provide
one way to do this: leaf-ancestor paths produce for each leaf in the
tree the path from that leaf back to the root. Generation is
simple as long as every sibling is unique. For example, the parse tree
\[\xymatrix{
  &&\textrm{S} \ar@{-}[dl] \ar@{-}[dr] &&\\
  &\textrm{NP} \ar@{-}[d] \ar@{-}[dl] &&\textrm{VP} \ar@{-}[d]\\
  \textrm{Det} \ar@{-}[d] & \textrm{N} \ar@{-}[d] && \textrm{V} \ar@{-}[d] \\
\textrm{the}& \textrm{dog} && \textrm{barks}\\}
\]
creates the following leaf-ancestor paths:

\begin{itemize}
\item S-NP-Det-The
\item S-NP-N-dog
\item S-VP-V-barks
\end{itemize}

For identical siblings, brackets must be inserted in the path to
disambiguate the first sibling from the second. The process is not
described here because the details are incidental to the main idea and
in any case identical siblings are somewhat rare.

Sampson originally developed leaf-ancestor paths as an improved
measure of similarity between gold-standard and machine-parsed trees,
to be used in evaluating parsers. The basic idea of a collection of
features that capture distance between trees transfers quite nicely to
this application. \namecite{sanders07} replaced POS trigrams with
leaf-ancestor paths for the ICE corpus and found improved results on
larger corpora. However, smaller corpora are less likely to attain
significance compared to POS trigram features.

% Another idea is supertags rather than leaf-ancestor paths. This is
% quite similar but might work better.

\subsubsection{Permutation test}
\label{permutationtest}

A permutation test detects whether two corpora are significantly
different. It does this on the basis of the R measure described in
section \ref{nerbonne06}. The test first calculates $R$ between
samples of the two corpora. Then the corpora are mixed together and
$R$ is calculated between two samples are drawn from the mixed
corpus. If the two corpora are different, $R$ should be larger between
the samples of the original corpora than $R$ from the mixed
corpus. Any real differences will be randomly redistributed by the
mixing process, lowering $R$ of samples. Repeating this comparison
enough times will show if the difference is significant. Twenty times
is the minimum needed to detect significance for $p < 0.5$
significance, although the test is repeated one thousand times in the
experiments.

For example, assume that $R$ detects real differences between London
and Scotland such that $R(\textrm{London},\textrm{Scotland}) =
100$. The permutation test then mixes the London and Scotland to
create LSMixed and splits it into two pieces. Since the real
differences are now mixed between the two shuffled corpora, we
would expect $R(\textrm{LSMixed}_1, \textrm{LSMixed}_2) < 100$, something
like 90 or 95. This should be true at least 95\% of the time if the
differences are significant.

\subsubsection{Normalization}
Afterward, the distance must be normalized to account for two things:
the length of sentences in the corpus and the amount of variety in the
corpus. If sentence length differs too much between corpora, there
will be consistently lower token counts in one corpus, which would
cause a spuriously large $R$. In addition, if one corpus has less
variety than the other, it will have inflated type counts, because
more tokens will be allocated to fewer types. To avoid
this all tokens are scaled by the average number of types per token
across both corpora: $2n/N$ where $n$ is the type count and $N$ is
the token count. The additional factor $2$ is necessary because we are
recombining the tokens from the two corpora.

\subsection{Analysis}

The results of the two distance methods can be compared using
binary hierarchical clustering. The resulting dendrogram allows the two
methods to be compared visually.

Correlation is also useful to find out how similar the two method's
predictions are. Because of the connected nature of the inter-region
distances, Mantel's test is necessary to ensure that the correlation
is significant. Mantel's test is a permutation test, similar to the
permutation test described for $R$. One distance result set is
permuted repeatedly and at each step correlated with the other
set. The correlation is significant if it is larger than the permuted
correlations more than 95\% of the time.

\section{Experiment}
The two methods described in the previous section, Levenshtein
distance for phonology and a permutation test based on {\sc
  Recurrence} for syntax, were used to analyze British dialect data.

These data come from two sources. The definitive phonological data
source is the Survey of English Dialects (SED) \cite{orton63}, dialect
data collected in the 1950s. This corpus consists mostly of
phonological features of older rural speakers, aiming to capture
dialect boundaries that would erode and mutate rapidly in the last
half of the 20th century. Recently, \namecite{shackleton07} carried
out a comprehensive computational study of the SED data. He focused
especially on vowels because of the amount of variation known to exist
in English vowels. I used his
data set organized by Government Office Region rather than county in
order to match the organization of the syntactic results below.

For syntax, I used the International Corpus of English (ICE)
\cite{nelson02}. The ICE is a corpus of syntactically annotated
conversations and speeches collected in the 1990s. I divided the
corpus into 11 sub-corpora of British Government Office Regions (GORs)
based on the recorded place of birth for each speaker. The division
into Government Office Regions was necessary in order to have enough
data in each sub-corpus, instead of the traditional, but much smaller,
counties. Nine GORs form England; Scotland and Wales each count as a
single GOR in this grouping. Only English data are available in the
SED, so the final comparison does not include Scotland and
Wales. However, they are shown in the intermediate syntactic results.

\subsection{Levenshtein Distance Experiment}

\namecite{shackleton07} chose a 55-word subset of the words elicited
by the SED survey. The subset focuses on words containing vowels and
consonants known to vary in England. Using this data set, I grouped
the speakers in the survey into their respective Government Office
Regions and compared all speakers of a region to all speakers of every
other region. The distances between speakers were averaged per region.
This produced a fully connected graph of the nine English GORs, with
the edges weighted by phonological distance.

\subsection{Syntax Distance Experiment}

Trigrams and leaf-ancestor paths were produced from the ICE corpus
grouped into GORs by speaker birthplace. Parameters for the experiment
varied sample size between 500 and 1000, $R$ measure between $r$ and
$r^2$, and input data type between trigram and leaf-ancestor path. To
test that the results from a particular parameter settings were valid,
I ran a preliminary test on London and Scotland. First, I
compared London and Scotland, which have several syntactic
differences already cataloged by linguists \cite{aitken79}.  Then I
mixed the the two together and split the mixed corpus into two the size of
the originals. If the two locations are properly identified as
different, London and Scotland should be significantly different with
$p < 0.05$, while the shuffled impostors should not be.

This was true for only one parameter setting: a distance measure of
$r$, a sample size of 1000 and leaf-ancestor paths as input. Other
parameter settings did not find London and Scotland to be
significantly different, indicating that the corpora were too small. A
few parameter settings were almost significant, with $p < 0.10$,
however.  As a result, I ran the rest of the experiment only for the
parameter setting that produced significant results. Here, all but 7
of the 55 connections between the 11 GORs were significant.

\section{Results}

\begin{figure}
  \includegraphics{sed_dendrogram}
\caption{Phonological distance cluster}
\label{phonology-dendrogram}
\end{figure}

For the phonological results, Levenshtein distance nicely clusters the
GORs into two main branches, North and South, shown in figure
\ref{phonology-dendrogram}. In the North, Yorkshire and Northwest
clustered fairly tightly, followed the East Midlands and the
Northeast. In the South, East England, London and the Southeast formed
a cluster that likely reflects the greater London area, followed by
the Southwest and West Midlands.

For the syntactic results, the permutation test over $R$ showed that
distances between all regions were significant except between the 7
regions in table \ref{syntax-nonsig}.

\begin{table}
\begin{tabular}{cc}
  Yorkshire & East Midlands \\ \hline
  East Midlands & London \\\hline
  London & Southeast \\\hline
  Southeast & Southwest \\\hline
  Northwest & London\\\hline
  Northwest & Southeast \\\hline
\end{tabular}
\caption{Regions with no significant $R$ difference}
\label{syntax-nonsig}
\end{table}

\begin{figure}
  \includegraphics{ice_dendrogram}
\caption{Syntactic distance cluster}
\label{syntax-dendrogram}
\end{figure}

The dendrogram produced by clustering $R$ distance (figure
\ref{syntax-dendrogram}) is harder to interpret because some of the
distances are not significant. There are two clear groups, but the
cluster on the left is not necessarily as tight as it appears because
many of the connections are based on distances that are not
significant, such as the East Midlands/London cluster. However, the
two major clusters still hold because of the significant inter-cluster
distances found between individual members.

There is generally a North/South distinction in the syntactic
clustering as well, although it is muddled a bit because the East of
England groups with the northern regions, and the Northwest groups
with southern regions. It's not clear why the Northwest should be so
close to the Southwest and the East of England so far from London and
the Southeast.

Two correlations were carried out using Pearson's $r$ and Mantel's test of
significance. The first was a correlation of $R$ with Levenshtein
distance, and the second was a correlation of $R$ with corpus
size. However, neither correlation with $R$ was significant.

\section{Discussion}

On the whole, the two methods of comparing distance, $R$ and
Levenshtein distance, produce differing results---there is no
significant correlation between the two methods. Although both show
something like a North/South distinction, its orientation is much more
obvious from phonological distance. This lack of agreement is a
preliminary answer to the question of this paper: whether multiple
ways of measuring linguistic distance give the same results.

There are at least five reasonable explanations for the difference
between the two distance measures.
% First, and
% least satisfying, is the possibility that one of the distances is not
% measuring what it is supposed to. Second, the corpora may not agree
% because of the 40 year difference in age and differing collection
% methodologies. Third, syntactic and phonological dialect markers may
% not share the same boundaries.

\begin{enumerate}
\item One or both of the distances does not measure what it is supposed to.
\item The two corpora may not agree on dialect boundaries because of
  their 40-year difference in age.
\item Place of birth, as recorded in the ICE, may not correlate well
  with spoken dialect, especially given variations in speaker
  education level and place of residence.
\item Dialect boundaries may appear from systematic variation in
  annotation practices rather than the speech.
\item Syntactic and phonological dialect boundaries may be different.
\end{enumerate}

Of these, the last is the most interesting because previous work will
not have exposed this difference. Traditional dialectometry focuses
on a strong agreement among a few features from each collection
site. Because syntactic features are fewer in number than phonological
ones, they are under-represented in this type of
analysis. Unfortunately, this means that the syntactic contribution to
isogloss bundles is correspondingly reduced. In addition, because of
isogloss bundles' insensitivity to rare variations that disagree with
the bulk of the data, syntactic features rarely contribute to
isogloss bundles of successful dialect boundaries.
% I really need to CITE this.

In contrast, computational analysis, such as \cite{shackleton07},
captures feature variation precisely using statistical analysis and
sophisticated algorithms. The resulting analysis displays dialects as
gradient phenomena, displaying much more complexity than the
corresponding isogloss analysis. But current specialized computational
methods only apply to phonology. Syntactic data cannot be analyzed
without a syntax-specific method.

This paper attempts to address that lack, and provide some first steps
to show whether syntax and phonology assist each other in establishing
dialects, or whether their dialect regions are unrelated. If they are not
related, and syntactic gradients can be as weak as phonological ones,
then some new dialect regions may become apparent that were not visible in
previous phonology-only analyses.

\subsection{Future Work}

This paper is, however, only a beginning. There are a number of ways
to improve upon the work presented here. The obvious criticism of the
syntax distance used here is that it requires so much data that the
results are no more detailed than those available from traditional
methods. Its precision lags phonological distance methods badly. The
first priority should be a syntax distance method that works with
smaller corpora. Such a method would allow the current experiment to
be re-run on counties rather than Government Office Regions, resulting
in more precise induction of dialect areas.

Another problem with the current study is the 40-year difference in
collection dates between the phonological corpus and the syntactic
corpus. A recent phonological corpus would likely show the same sort
of changes in the North/South divide that show up in the syntactic
corpus. The British population became more mobile during the second
half of the 20th century, and the SED survey explicitly attempted to capture
the dialects that existed before this happened \cite{orton78}.
It would also be nice to have data from the rest of the United Kingdom for
comparison as well, or at least Scotland and Wales as with the ICE.

% TODO: integrate this.
% Alternatively, I could just look at the region pairs that fail to
% achieve significance in the syntactic permutation test and check to
% see if their phonological distance is lower than the other pairs. I
% don't do this (yet).

One interesting question is
how phonological and syntactic distances correlate with geographic
distance---\namecite{gooskens04a} shows that often the correlation is
very good. This would also allow better visualization of dialect areas
than a hierarchical dendrogram.

%   \item Some ideas include using supertags instead of leaf-ancestor
%     paths, perhaps automatically extracted from the manual parse.
%   \item Another idea is comparing entropy by compression. This seems
%     pretty promising to me.

\bibliographystyle{robbib} % ACL is \bibliographystyle{acl}
\bibliography{central}
\end{document}
%%% Local Variables: 
%%% mode: latex
%%% TeX-master: t
%%% End: