methods.tex

\chapter{Methods}
\label{methods-chapter}

This chapter contains four sections. The first discusses related work,
dialectometry since its development in the middle of the 20th century,
starting with \namecite{seguy73} and continuing with
\namecite{goebl06}. The second section discusses the previous work on
statistical methods for syntactic dialectometry, as well as the
feature sets and distance measures developed in this dissertation. The
third and fourth sections cover the application of the work to Swedish
dialects. Specifically, the third section deals with input analysis:
which Swedish corpora were used and which annotators served as a basis for
extracting features. The fourth section covers output analysis,
detailing the methods used to process the distances between interview
sites so that they can be compared to the dialectology literature.

\section{Related Work}

\subsection{S\'eguy}

Measurement of linguistic similarity has always been a part of
linguistics. However, until \namecite{seguy73} dubbed a new set of
approaches `dialectometry', these methods lagged behind the rest of
linguistics in formality. S\'eguy's quantitative analysis
of Gascogne French, while not aided by computer, was the predecessor
of more powerful statistical methods that essentially required the use
of computer as well as establishing the field's general dependence on
well-crafted dialect surveys that divide incoming data along
traditional linguistic boundaries: phonology, morphology, syntax, etc.
This makes both collection and analysis easier, although it requires
more work to combine separate analyses to produce a complete picture of dialect
variation.

The project to build the Atlas Linguistique et Ethnographique de la
Gascogne, which S\'eguy directed, collected data in a dialect survey
of Gascogne which asked speakers questions informed by different areas
of linguistics. For example, the pronunciation of `dog' ({\it chien})
was collected to measure phonological variation. It had two common
variants and many other rare ones: [k\~an], [k\~a], as well as [ka],
[ko], [kano], among others. These variants were, for the most part,
% or hat "chapeau": SapEu, kapEt, kapEu (SapE, SapEl, kapEl
known by linguists ahead of time, but their exact geographical
distribution was not.

The atlases, as eventually published, contained not only annotated
maps, but some analyses as well. These analyses were what S\'eguy named
dialectometry. Dialectometry differs from previous attempts to find
dialect boundaries in the way it combines information from the
dialect survey. Previously, dialectologists found isogloss
boundaries for individual items. A dialect boundary was generated when
enough individual isogloss boundaries coincided. However, for any real
corpus, there is so
much individual variation that only major dialect boundaries can
be captured this way.

S\'eguy reversed the process. He first combined survey data to get
a numeric score between each site. Then he posited dialect boundaries
where large distances resulted between sites. The difference is
important, because a single numeric score is easier to
analyze than hundreds of individual boundaries.
Much more subtle dialect boundaries are visible this way; where before
one saw only a jumble of conflicting boundary lines, now one sees
smaller, but consistent, numerical differences separating regions. {Dialectometry
  enables classification of gradient dialect boundaries, since now one
can distinguish weak and strong boundaries. Previously, weak
boundaries were too uncertain.}

However, S\'eguy's method of combination is simple both
linguistically and mathematically. When comparing two sites, any
difference in a response is counted as 1. Only identical
responses count as a distance of 0. Words are not analyzed
phonologically, nor are responses weighted by their relative amount
of variation. Finally, only geographically adjacent sites are
compared. This is a reasonable restriction, but later studies were
able to lift it because of the availability of greater computational
power. Work following S\'eguy's improves on both aspects. In
particular, Hans Goebl developed dialectometry models that are
more mathematically sophisticated, while retaining the survey-syle small feature set.

\subsection{Goebl}

Hans Goebl emerged as a leader in the field of dialectometry,
formalizing the aims and methods of dialectometry. His primary
contribution was development of various methods to combine individual
distances into global distances and global distances into global clusters. These
methods were more sophisticated mathematically than previous
dialectometry and operated on any features extracted from the data. His
analyses have used primarily the Atlas Linguistique de Fran\c{c}ais.

\namecite{goebl06} provides a summary of his work. Most relevant
are the measures Relative Identity Value and Weighted
Identity Value. They are general methods that are the basis for nearly
all subsequent fine-grained dialectometrical analyses. They have three
important properties. First, they are independent of the source
data. They can operate over any linguistic data for which they are
given a feature set, such as the one proposed by \namecite{gersic71} for
phonology. Second, they can compare data even for items that do not
have identical feature sets, unlike Ger\v{s}i\'c's measure $d$, for example,
which cannot compare consonants and vowels. Third, they can compare
data sets that are missing some entries. This improves on S\'eguy's
analysis by providing a principled way to handle missing survey
responses.

Relative Identity Value, when comparing any two items, counts the
number of features which share the same value and then discounts
(lowers) the importance of the result by the number of unshared
features. The result is a single percentage that indicates relative
similarity. This percentage, when measured between all pairs of sites
in a corpus, can be scaled to produce a dissimilarity. Note that the
presentation below splits Goebl's original equations into more
manageable pieces; the high-level equation for Relative Identity Value
is:

\begin{equation}
  \frac{\textrm{identical}_{jk}} {\textrm{identical}_{jk} - \textrm{unidentical}_{jk}}
\label{riv}
\end{equation}
For some items being compared $j$ and $k$. In this case
\textit{identical} is
\begin{equation}
  \textrm{identical}_{jk} = |f \in \textrm{\~N}_{jk} : f_j = f_k|
\end{equation}
where $\textrm{\~N}_{jk}$ is the set of features shared by $j$ and
$k$. In other words, of the total universe of features $N$, both $j$
and $k$ must contain the feature for it to be included in
$\textrm{\~N}_{jk}$. So if a feature occurs only in $j$ but not in
$k$, it will be included in $N$, but not in $\textrm{\~N}_{jk}$. This
ensures that the comparison $f_j = f_k$ is always valid, where $f_j$
and $f_k$ are the value of some feature $f$ for $j$ and $k$
respectively.  \textit{unidentical} is defined similarly, except that
it counts all features N, not just the shared features
$\textrm{\~N}_{jk}$. Here, features that occur in only $j$ or only $k$
contribute toward \textit{unidentical}'s total.

\begin{equation}
  \textrm{unidentical}_{jk} = |f \in \textrm{N} : f_j \neq f_k|
\end{equation}

Weighted Identity Value (WIV) is a refinement of Relative Identity
Value. This measure defines some differences as more
important than others. In particular, feature values that only occur
in a few items give more information than feature values that appear
in a large number of items. \quotecite{wiersma09} normalization,
covered at the end of this chapter, reuses this idea for feature
ranking.

The reasoning behind this idea is fairly simple. Goebl
is interested in feature values that occur in only a few items. If a
feature has some value that is shared by all of the items, then all
items belong to the same group. This feature value provides {\it no}
useful information for distinguishing the items.  The situation
improves if all but one item share the same value for a feature; at
least there are now two groups, although the larger group is still not
very informative. The most information is available if each item
being studied has a different value for a feature; the items fall
trivially into singleton groups, one per item. 

Equation \ref{wiv-ident} implements this idea by discounting
the \textit{identical} count from equation \ref{riv} by
the amount of information that feature value conveys. The
amount of information, as discussed above, is based on the number of
items that share a particular value for a feature. If all items share
the same value for some feature, then \textit{identical} will be discounted all the
way to zero--the feature conveys no useful information.
Weighted Identical Value's equation for \textit{identical} is
therefore
\begin{equation}
  \textrm{identical} = \sum_f \left\{
  \begin{array}{ll}
    0 & \textrm{if} f_j \neq f_k \\
    1 - \frac{\textrm{agree}f_{j}}{(Ni)w} & \textrm{if} f_j = f_k
  \end{array} \right.
\label{wiv-ident}
\end{equation}

\noindent{}The complete definition of Weighted Identity Value is
\begin{equation} \sum_i \frac{\sum_f \left\{
  \begin{array}{ll}
    0 & \textrm{if} f_j \neq f_k \\
    1 - \frac{\textrm{agree}f_j} {(Ni)w} & \textrm{if} f_j = f_k
\end{array} \right.}
  {\sum_f \left\{
  \begin{array}{ll}
    0 & \textrm{if} f_j \neq f_k \\
    1 - \frac{\textrm{agree}f_j} {(Ni)w} & \textrm{if} f_j = f_k
    \end{array} \right. - |f \in \textrm{N} : f_j \neq f_k|}
  \label{wiv-full}
  \end{equation}

  \noindent{}where $\textrm{agree}f_{j}$ is the number of items that
  agree with item $j$ on feature $f$ and $Ni$ is the total number of
  items ($w$ is the weight, discussed below). Because of the piecewise
  definition of \textit{identical}, this number is always at least $1$
  because $f_k$ agrees already with $f_j$.  This equation takes the
  count of shared features and weights them by the size of the sharing
  group. The features that are shared with a large number of other
  items get a larger fraction of the normal count subtracted. WIV is
  similar to entropy from information theory, which forms the basis of
  the Kullback-Leibler and Jensen-Shannon divergences described later
  in this chapter \cite{lin91}. The difference is that WIV subtracts
  values from 1 to make common features less important, while entropy
  takes the logarithm. The result is similar, but the two divergences
  are theoretically more principled in directly referring to information theory.
% TODO: CITE Shannon. Seriously, how did I avoid it?

  For example, let $j$ and $k$ be sets of productions for the
  underlying English segment /s/. The allophones of /s/ vary mostly on the feature
  \textit{voice}. Seeing an unvoiced [s] for /s/ is less ``surprising'' than
  seeing a voiced [z], so the discounting process should
  reflect this. For example, assume that an English corpus contains 2000
  underlying /s/ segments. If 500 of them are realized as [z], the
  discounting for \textit{voice} will be as follows:

  \begin{equation}
    \begin{array}{c}
      identical_{/s/\to[z]} = 1 - 500/2000 = 1 - 0.25 = 0.75 \\
      identical_{/s/\to[s]} = 1 - 1500/2000 = 1 - 0.75 = 0.25
    \end{array}
    \label{wiv-voice}
  \end{equation}

  Each time /s/ surfaces as [s], it only receives 1/4 of a point
  toward the agreement score when it matches another [s]. When /s/
  surfaces as [z], it receives three times as much for matching
  another [z]: 3/4 points towards the agreement score. If the
  alternation is even more weighted toward faithfulness, the ratio
  changes even more; if /s/ surfaces as [z] only 1/10 of the time,
  then [z] receives 9 times more value for matching than [s] does.

  The final value, $w$, which is what gives the name ``weighted
  identity value'' to this measure, provides a way to control how much
  is discounted. A high $w$ will subtract more from uninteresting
  groups, so that \textit{voice} might be worth less than
  \textit{place} for /t/ because /t/'s allophones vary more over
  \textit{place}. In equation \ref{wiv-voice}, $w$ is left at 1 to
  facilitate the presentation, but typically it is used like an ad-hoc
  equivalent of information gain: the linguist can give more weight to
  features that are believed to be salient.

\section{Dialectometry}
\label{methods-chapter-dialectometry-section}

It is at this point that the two types of analysis, phonological and
syntactic, diverge. Although Goebl's techniques are general enough to
operate over any set of features that can be extracted, better results
can be obtained by specializing the general measures above to take
advantage of properties of the input.  Specifically, the application
of computational linguistics to dialectometry beginning in the 1990s
introduced methods from other fields. These methods, while generally
giving more accurate results quickly, are tied to the type of data on
which they operate.

Currently, the dominant phonological distance measure is Levenshtein
distance. This distance is essentially the count of differing
segments, although various refinements have been tried, such as
inclusion of distinctive features or phonetic
correlates. \namecite{heeringa04} gives an excellent analysis of the
applications and variations of Levenshtein distance. He investigated
varying levels of detail and differing feature sets. Interestinly,
although he extracted features from phonetic correlates, phonological
(distinctive) features, segments, and orthographic characters, the
more complex features failed to give any significant improvement over
simple segments. In addition, while Levenshtein
distance provides much information as a classifier, it is limited
because it must have a word-aligned corpus for comparison. A number of
statistical methods have been proposed that remove this requirement
such as \namecite{hinrichs07} and \namecite{sanders09}, but none have
been as successful on existing dialect resources, which are small and
are already word-aligned. New resources are not easy to develop
because the statistical methods still rely on a phonetic transcription
process.

\subsection{Syntactic Distance}

Recently, computational dialectometry has expanded to analysis of
syntax as well. The first work in this area was \quotecite{nerbonne06}
analysis of Finnish L2 learners of English, followed by
\quotecite{sanders07} analysis of British dialect areas. As explained
in chapters \ref{background-chapter} and \ref{questions-chapter},
syntax distance must be approached quite differently than phonological
distance. Syntactic corpora can be built quickly by automatically
annotating raw text, so it is easier to build a large syntactic corpus
than a phonological one; phonological annotation does not yet have a
method for automatic annotation. However, automatic annotators, while
faster, cannot compete with human annotators in quality of annotation.
This trade-off between annotation methods leads to the principal
difference between present phonological and syntactic corpora:
phonology data is word-aligned, keeping varying segments relatively
close, while syntax data is not sentence-aligned, meaning that
variation is distributed throughout the corpus. This difference leads
syntactic approaches naturally to statistical measures over large
amounts of data rather than more sensitive measures that operate on
small corpora.

\namecite{nerbonne06} were the first to use the syntactic distance
measure described below. They analyzed a corpus of Finnish L2 speakers
of English, divided by age. The first age group consisted of speakers
who learned English after childhood and the second of speakers who
learned English as children. Nerbonne \& Wiersma found a significant
difference between the two age groups. The features that were
unexpected in English contributed most to the difference; these were
associated primarily with the older age group. For example, some important
features for the older age group involved determiners, which English
has but Finnish does not. The features showed underuse of determiners,
as well as overuse, probably due to hypercorrection. Interestingly,
some of these features occur in the younger age group, but not as
often.  Nerbonne \& Wiersma analyzed this pattern as interference from
Finnish; the younger age group learned English more completely with
less interference from Finnish.

My subsequent work in \cite{sanders07} and \cite{sanders08b}
expanded on the Finnish experiment in two ways. First, it introduced
leaf-ancestor paths as an alternative feature type. Second, it tested
the distance method on a larger set of sites: the Government Office
Regions of England, as well as Scotland and Wales, for a total of
11 sites. Each was smaller than the Finnish L2 age groups, so the
permutation test parameters had to be adjusted for some feature
combinations.

The distances between regions were clustered using hierarchical
agglomerative clustering, as described in section
\ref{cluster-analysis}. The resulting tree showed a North/South
distinction with some unexpected differences from previously
hypothesized dialect boundaries; for example, the Northwest region
clustered with the Southwest region. This contrasted with the
clustered phonological distances also produced in
\namecite{sanders08b}. In that experiment, there was no significant
correlation between the inter-region phonological distances and
syntactic distances.

There are several possible reasons for this lack of correlation. The
two distance measures may find different dialect boundaries based on
differences between syntax and phonology. Dialect boundaries may have
shifted during the 40 years between the collection of the SED and the
collection of the ICE-GB. One or both methods may be measuring the
wrong thing. In this dissertation, although the focus remains on results
of computational syntax distance as compared to traditional syntactic
dialectology, the discussion compares recent phonological
dialectometry results on Swedish to the results obtained here.

\subsubsection{Nerbonne and Wiersma}
\label{nerbonne06}

Due to the lack of alignment between the larger corpora available for
syntactic analysis, a statistical comparison of differences is more
appropriate than the simple symbolic approach possible with the
word-aligned corpora used in phonology. This statistical approach
means that a syntactic distance measure will have to use counting as
its basis.

\namecite{nerbonne06}'s method models syntax by part-of-speech (POS)
trigrams and uses differences between trigram type counts in a
permutation test of significance. The heart of the measure is simple:
the difference in type counts between the combined features of two
sites. \namecite{kessler01} originally proposed this measure, the
{\sc Recurrence} metric ($R$):

\begin{equation}
R = \sum_i |c_{ai} - c_{bi}|
\label{rmeasure}
\end{equation}

\noindent{}Given two sites $a$ and $b$, $c_a$ and $c_b$ are the
feature counts. $i$ ranges over all features, so $c_{ai}$ and $c_{bi}$ are the
counts of sites $a$ and $b$ for feature $i$. $R$ is designed to
represent the amount of variation exhibited by the two sites while
the contribution of individual features remains transparent to aid later
analysis. Unfortunately, it doesn't indicate whether its results are significant; a
permutation test is needed for that, described in section
\ref{permutationtest}.

\subsubsection{Dialectometry in British English}

The methods used in this dissertation are an evolution of those in my
previous work on British English: \cite{sanders07} and
\cite{sanders08b}. There, I compared phonological and syntactic
dialectometry as described above. The process is similar to Wiersma's
work in \cite{nerbonne06} and \cite{wiersma09}, but with
variants of both feature set and distance measure.

The input is 30 interview sites (described in section
\ref{syntactically-annotated-corpus}). The sentences in each site have
their features extracted (the features are described in section
\ref{syntactic-features}). Optionally, only 1000 sentences per site
are sampled with replacement, but the site sizes, unlike the British
interviews in my previous work, are fairly similar in size so this is
only required for comparison to previous work. Then the features are
counted, producing a mapping of feature types to token counts.

At this point, two sites are compared based on these feature
counts. The feature counts are first normalized to account for
variation in corpus size (described in
the next section). Then they are converted to ratios, meaning
that the counts are scaled relative to the other site. For example,
counts of 10 and 30 would produce the ratio 1 to 3, as would the
counts 100 and 300. Finally, the distance (described above in
\ref{nerbonne06}) is calculated 10 times and the result is averaged.

The sites are sampled by sentence rather than by feature because the
intent is to capture syntax, where the composite unit is the
sentence. Similarly, phonology's composite unit is the word---most
processes operate within the word on individual segments; some
processes operate between words but they are fewer. Therefore, the
assumption that words are independent will lose some information but
not the majority. In the same way, the basic unit of syntax is the
sentence; processes operate on the words in the sentence, but
inter-sentence processes are fewer. Because of this, the sites are
sampled by sentence, combining the sentences of all speakers from an
interview site.

This dissertation skips the per-speaker sampling of
\quotecite{wiersma09} work on Finnish L2 speakers. I assume that,
since discovery of dialect features is the goal of this research, the
sentences of speakers from the same village are independent of the
speaker, at least with respect to dialect features. Although the
motivation is partly theoretical, there is also a difference between
the Swediasyn dialect corpus, with 2--4 speakers for each of 30 sites,
and Wiersma's L2 corpus, with dozens of speakers but only two
groups. Sampling per-speaker would not be feasible for the Swediasyn
because there aren't enough speakers per village.

\subsubsection{Normalization}
\label{normalization}

The two sites being compared can differ in size, even if the
samples contain the same number of sentences; if one site contains many
long sentences and the other contains many short ones, raw counts
will favor the features extracted from the long sentences simply
because each sentence yields more features. Additionally, the counts
are converted to ratios to ignore the effect of
frequency---in effect, this ranks features only by how much they differ
between the two sites, ignoring the question of how often they occur
relative to the other features extracted from the two sites. That is,
a high ratio for a rare feature that happens only ten times in both
sites is just as important as a high ratio for a common feature that
happens thousands of times.

The first normalization normalizes the counts for each feature within
the pair of sites $a$ and $b$. The purpose is to normalize the
difference in sentence length, where longer sentences with more words
cause features to be relatively more frequent than sites with many
short sentences.  Each feature count $i$ in a vector, for example
$a$, is converted to a frequency $f_i$ \[f_i=\frac{i}{N} \] where
$N$ is the length of $a$. For two sites $a$ and $b$ this produces
two frequency vectors, $f_{a}$ and $f_{b}$.Then the original counts in
$a$ and $b$ are redistributed according to the frequencies in
$f_a$ and $f_b$:
\[a'_{i} = \frac{f_{ai}(a_{i}+b_{i})}{f_{ai}+f_{bi}},
b'_{i} = \frac{f_{bi}(a_{i}+b_{i})}{f_{ai}+f_{bi}}\]
This redistributes the total of a pair from $a$ and $b$ based on
their relative frequencies. In other words, the total for each feature
remains the same:
\[ a_{i} + b_{i} = a'_{i} + b'_{i} \]
but the values of $a'_{i}$ and $b'_{i}$ are scaled by their frequency
within their respective vectors.

For example, assume that the two sites have 10 sentences each, with a
site $a$ with only 40 words and another, $b$, with 100 words. This
results in $N_a = 40$ and $N_b = 100$. Assume also that there is a
feature $i$ that occurs in both: $a_{i} = 8$ and $b_{i} = 10$. This
means that the relative frequencies are $f_{ai} = 8/40 = 0.2$ and
$f_{bi} = 10/100 = 0.1$. The first normalization will redistribute the
total count ($10 + 8 = 18$) according to relative frequencies. So
\[a'_i = \frac{0.2(18)}{0.2+0.1} = 3.6 / 0.3 = 12\] and
\[b'_i = \frac{0.1(18)}{0.2+0.1} = 1.8 / 0.3 = 6\] Now that 8 has
been scaled to 12 and 10 to 6, the fact that site $b$ has more words
has been normalized. This reflects the intuition that something that
occurs 8 of 40 times is more important than something that occurs 10
of 100 times.

% this is the (*2n / N) bit
The second normalization normalizes all values in both permutations
with respect to each other. This is simple: find the average number of
times each feature appears, then divide each scaled count by it. This
produces numbers whose average is 1.0 and whose values are multiples
of the amount that they are greater than the average.  The average
feature count is $N / 2n$, where $N$ is the number of feature
occurrences and $n$ is the number of feature
types in the combined sites. Division by two is necessary since we are multiplying counts
from a single permutation by summed counts from the combined sitesboth
permutations. Each entry in the ratio vector now becomes
\[ r_{ai} = \frac{2na'_i}{N}, r_{bi} = \frac{2nb'_i}{N}\]

For example, given the previous example numbers, this second
normalization first finds the average. Assuming 5 unique features for
$a$'s 40 total features and 30 for $b$'s total 100 features gives
\[n = 5 + 30 = 35\] and
\[N = 40 + 100 = 140\] Therefore, the average feature has $140 / 2(35)
= 2$ occurrences in $a$ and $b$ respectively. Dividing $a'_i = 12$ and
$b'_i = 6$ by this average gives $r_{ai} = 6$ and $r_{bi} = 3$. In
other words, $r_{ai}$ occurs 6 times more than the average feature.

Together, these normalizations control for the effect of variation in
sentence length (the first normalization), corpus size (the second
normalization), and relative overuse (the second
normalization). Furthermore, the normalizations can be iterated, with
the normalized output further re-normalized. This exaggerates the
differntiating effect of the normalization, which allows distance
measures to be more sensetive to feature count variations.

\subsection{Syntax Features}
\label{syntactic-features}

In order to answer question 1, whether the distance measure agrees
with dialectology, a distance measure such as $R$ needs features that
capture the dialect syntax of the interview corpus given as
input. Following \namecite{nerbonne06}, I start with parts of speech,
then add the leaf-ancestor paths from my work on the ICE-GB
\cite{sanders07}, and finally add leaf-head paths and
phrase-structure rules, as well as variants on these features. These
feature sets each depend on a different type of automatic annotation,
which is described in section \ref{parsers}.

\namecite{nerbonne06} argue that POS trigrams can accurately represent
at least the important parts of syntax, similar to the way chunk
parsing can capture the most important information about a
sentence. If this is true, POS trigrams are a good starting point for
a language model; they are simple and easy to obtain in a number of
ways. They can either be generated by a tagger as Nerbonne
and Wiersma did, or taken from the leaves of the trees of a
syntactically annotated corpus as I did with the
International Corpus of English \cite{sanders07}.

Of course, bigrams are a possible feature since they are so similar to
trigrams. I do not use them here for several reasons. First, previous
work uses trigrams, so trigrams are needed in order to remain
comparable. But bigrams offer only reduced sparseness and noise
reductions compared to trigrams. However, neither feature sparseness
nor noise is a problem for trigrams when used with the distance
measures developed here, as will be seen in the results in chapter
\ref{results-chapter}.

On the other hand, if syntax is in fact a phenomenon that involves
hidden structure above the visible words of the sentence, a feature
set should be constructed to capture that
structure. \quotecite{sampson00} leaf-ancestor paths provide one way
to do this: for each leaf in the parse tree, leaf-ancestor paths
produce the path from that leaf back to the root. Generation is simple
as long as every sibling is unique. For example, the parse tree

\Tree[.S [.NP [.Det the ] [.N dog ] ] [.VP [.V barks ] ] ]

creates the following leaf-ancestor paths:

\begin{itemize}
\item S-NP-Det-the
\item S-NP-N-dog
\item S-VP-V-barks
\end{itemize}

For identical siblings, brackets must be inserted in the path to
disambiguate the first sibling from the second.
There is one path for each word, and the root appears
in all four. However, there can be ambiguities if some
node happens to have identical siblings. Sampson gives the example
of the two trees

\Tree[.A [.B p q ] [.B r s ] ]

and

\Tree[.A [.B p q r s ] ]

which would both produce

  \begin{itemize}
  \item A-B-p
  \item A-B-q
  \item A-B-r
  \item A-B-s
  \end{itemize}

  There is no way to tell from the paths which leaves belong to which
  B node in the first tree, and there is no way to tell the paths of
  the two trees apart despite their different structure. To avoid this
  ambiguity, Sampson uses a bracketing system; brackets are inserted
  at appropriate points to produce
  \begin{itemize}
  \item $[$A-B-p
  \item A-B]-q
  \item A-[B-r
  \item A]-B-s
  \end{itemize}
and
  \begin{itemize}
  \item $[$A-B-p
  \item A-B-q
  \item A-B-r
  \item A]-B-s
  \end{itemize}

Left and right brackets are inserted: at most one
in every path. A left bracket is inserted in a path containing a leaf
that is a leftmost sibling and a right bracket is inserted in a path
containing a leaf that is a rightmost sibling. The bracket is inserted
at the highest node for which the leaf is leftmost or rightmost.

It is a good exercise to derive the bracketing of the previous two trees in detail.
In the first tree, with two B
siblings, the first path is A-B-p. Since $p$ is a leftmost child,
a left bracket must be inserted, at the root in this case. The
resulting path is [A-B-p. The next leaf, $q$, is rightmost, so a right
bracket must be inserted. The highest node for which it is rightmost
is B, because the rightmost leaf of A is $s$. The resulting path is
A-B]-q. Contrast this with the path for $q$ in the second tree; here $q$
is not rightmost, so no bracket is inserted and the resulting path is
A-B-q. $r$ is in almost the same position as $q$, but reversed: it is the
leftmost, and the right B is the highest node for which it is the
leftmost, producing A-[B-r. Finally, since $s$ is the rightmost leaf of
the entire sentence, the right bracket appears after A: A]-B-s.

At this point, the alert reader will have
noticed that both a left bracket and right bracket can be inserted for
a leaf with no siblings since it is both leftmost and rightmost. That is,
a path with two brackets on the same node could be produced: A-[B]-c. Because
of this redundancy, single children are
excluded by the bracket markup algorithm. There is still
no ambiguity between two single leaves and a single node with two
leaves because only the second case will receive brackets.

% See for yourself:
% \[\xymatrix{
%   &\textrm{A} \ar@{-}[dl] \ar@{-}[dr] &\\
%   \textrm{B} \ar@{-}[d] &&\textrm{B} \ar@{-}[d] \\
%   \textrm{p} && \textrm{q} \\
% }
% \]

% \[\xymatrix{
%   &\textrm{A} \ar@{-}[d] &\\
%   &\textrm{B} \ar@{-}[dl] \ar@{-}[dr] & \\
%   \textrm{p} && \textrm{q} \\
% }
% \]
% \cite{sampson00} also gives a method for comparing paths to obtain an
% individual path-to-path distance, but this is not necessary for the
% permutation test, which treats paths as opaque symbols.


Sampson originally developed leaf-ancestor paths as an improved
measure of similarity between gold-standard and machine-parsed trees,
to be used in evaluating parsers. The underlying idea of a collection
of features that capture distance between trees transfers quite nicely
to this application. I replaced POS trigrams with leaf-ancestor paths
for the ICE corpus and found improved results on smaller sites than
Nerbonne and Wiersma had tested \cite{sanders07}. The additional
precision that leaf-ancestor paths provide appears to aid in attaining
significant results.

\subsubsection{Leaf-Head Paths}
\label{leaf-head-paths}

For dependency parses, it is easy to create a variant of leaf-ancestor
paths called ``leaf-head paths''. Like leaf-ancestor paths, each word
in the sentence is associated with a single leaf-head path. The
difference is that the path is from the leaf to the head of the sentence via the
intermediate heads. For example, the same sentence, ``The dog barks'',
produces the following leaf-head paths, given the dependency parse in
figure \ref{example-dep-parse}:

\begin{figure}
\[\xymatrix{
& & root \\
DET \ar@/^/[r]^{DT} & NP\ar@/^/[r]^{SS} & V \ar@{.>}[u] \\
The & dog & barks
}
\]
\caption{Dependency parse for ``The dog barks.''}
\label{example-dep-parse}
\end{figure}

\begin{itemize}
\item root-V-N-Det-the
\item root-V-N-dog
\item root-V-barks
\end{itemize}

The biggest difference between leaf-ancestor paths and leaf-head paths
is the relative length of the paths: long
leaf-ancestor paths indicate deep nesting of structure, while short
ones indicate flatter structure. Length is a
weaker indicator of deep structure for leaf-head
paths; for example, the verb in a nested clause has a much shorter
leaf-head path than leaf-ancestor path, but its dependents have
comparable lengths between the two types of paths. Instead, length of
path measures centrality to the sentence; longer leaf-head paths
indicate less important words.

Leaf-head paths represent a compromise between leaf-ancestor paths and
trigrams. Like trigrams, they capture lexical context, but the context
is based on head dependencies, so long-distance context is
possible. Like leaf-ancestor paths, they capture information about the
nested structure of the sentence, although not as completely or
explicitly.

\subsection{Alternate Feature Sets}
\label{alternate-feature-sets}

This section describes the variants besides the main feature sets
already described above: trigrams, leaf-ancestor paths and leaf-head
paths. Most are variants on these three main sets.

\subsubsection{Part-of-Speech Unigrams}

Part-of-speech unigrams are single parts of speech. Unlike POS
trigrams, they do not capture context or order, only distributional
differences. In this dissertation, they serve as a baseline since they
are not expected to capture syntactic variation as much as the other
feature sets.

\subsubsection{Phrase Structure Rules}

Phrase structure rules are extracted from the same parses as
leaf-ancestor paths, but instead of capturing a series of parent-child
relations, it captures single-level parent-child-sibling
relations. For example, given the tree in figure
\ref{psg-example-tree} the extracted rules are given in
figure \ref{psg-example}.

\begin{figure}
\Tree[.S [.NP [.Det the ] [.N dog ] ] [.VP [.V barks ] ] ]
 \caption{Example Tree}
  \label{psg-example-tree}
\end{figure}

\begin{figure}
  \begin{tabular}{ccc}
    \Tree[.S NP VP ] & \Tree[.NP Det N ] & \Tree[.VP V ] \\
  \end{tabular}
 \caption{Phrase-Structure Rules Extracted}
  \label{psg-example}
\end{figure}

Phrase structure rules are most similar to leaf-ancestor paths in
emphasizing the hidden, parse structure of constituency parse trees. Unlike
leaf-ancestor paths, they capture some context to the left and
right. They also only cover one level in the tree, whereas
leaf-ancestor paths traverse it from leaf to root. Phrase structure
rules have the possibility to be useful in sentences where context is
important, but they also depend on having accurate parses even at the
top of the tree. This is difficult for automatic parsers to achieve.

\subsubsection{Grandparent Phrase Structure Rules}

Grandparent phrase structure rules are a variant of phrase structure
rules that include the grandparent as well. Given the tree in figure
\ref{psg-example-tree}, the extracted features are given in
figure \ref{grand-psg-example}.

\begin{figure}
  \begin{tabular}{ccc}
    \Tree[.ROOT [.S NP VP ] ] & \Tree[.S [.NP Det N ] ] &
    \Tree[.S [.VP V ] ] \\
  \end{tabular}
\caption{Grandparent Phrase-Structure Rules Extracted}
 \label{grand-psg-example}
\end{figure}

Grandparent phrase structure rules add some of the vertical information present in
leaf-ancestor paths, hopefully without introducing data sparseness
problems. However, they retain the advantage over
leaf-ancestor paths of capturing left and right context.

\subsubsection{Arc-Head Paths}

As described in section \ref{leaf-head-paths}, the usual labels for
leaf-head paths are the leaves of the tree: `root-V-N-Det-the' is the
first leaf-head path for ``The dog barks'', which has the parts of
speech ``Det N V''. However, one can
also use the arc labels of the dependency parse to create arc-head
paths. These paths have the same shape as their corresponding
leaf-head paths, but use the labels of the dependency arcs between
words instead of the parts of speech of the words themselves.

The sentence for the leaf-head example is given in figure
\ref{example-dep-parse}, and the resulting arc-head paths are

\begin{itemize}
\item root-SS-DT-the
\item root-SS-dog
\item root-barks
\end{itemize}
% TODO: To the future work section, add:
% 1. Extract deps from CFG parses from Berkeley
% 2. Label dep features with both arc and POS tags interleaved in the
% proper order.
% 3. Better tag set. (duh, probably already have this one)
% 4. Non-linear feature set combination.

\subsubsection{Tags from Berkeley Parser}

The Berkeley parser \cite{petrov06}, as described in section
\ref{parsers}, can either tag incoming sentences with its own part of
speech tagger, using the same splitting process as the rest of the
parser, or with parts of speech specified externally. In this case,
the external part-of-speech tagger is T`n'T \cite{brants00}. Although
the Berkeley-generated POS tags are not as good, it may be useful to
see how they change the overall results---although it seems that
accurate parts of speech are required for good features to be
generated, it is useful to see how much the results degrade when given
lower quality parts of speech.  Note that the Berkeley-generated POS
tags are used to generate trigrams and leaf-head paths, by taking the
POS tags and feeding them into MaltParser.

To get the Berkeley parser to generate parts of speech, it is given
the interview sites directly, skipping the tagging by T`n'T.  Using
the same method it uses to parsing the higher structure of the
sentence, it also tags words with parts of speech. After parsing,
these parts of speech are extracted from the leaves of the parse
trees. First, the parts of speech are used to create trigram
features. Second, the parts of speech are given to MaltParser for
dependency parsing. This produces dependency parses based on parts of
speech from the Berkeley parser. The result is trigrams and leaf-head
paths based on Berkeley-generated POS tags instead of T`n'T-generated
POS tags.

\subsubsection{Dependencies from alternate MaltParser training}

Since MaltParser uses Nivre's oracle-based dependency parsing
algorithm, the default oracle, based on support vector machines, can
be replaced with Timbl, the Tilburg Memory-Based Learner. It is
possible that a memory-based learner improves parsing
because support vector machines depend on large
training corpora to provide good results. In contrast, a memory-based
learner can obtain good results on limited training if the training
happens to be representative and the right combination of parameters
can be found for Timbl.

This is, however, somewhat complicated since Timbl is quite sensitive
to parameter changes and usually requires specific tuning for
particular tasks. To find the best
parameters, I use a manual search across a number of the major
distance measures provided by Timbl, as well as fallback-combinations
from more complicated distance measures to less complicated ones.

Each combination was evaluated with ten-fold cross-validation on
Talbanken. The best combination was Jeffrey divergence with 5 nearest
neighbors, no feature weighting, inverse distance neighbor
weighting, and fallback to the Overlap metric for fewer than two
neighbors. Jeffrey divergence is a symmetric variant of
Kullback-Leibler divergence, also described in section
\ref{kl-divergence}. These parameter settings were used as a basis for
parsing and generation of leaf-ancestor paths.

% \subsubsection{Within-clause Dependency/Leaf-ancestor paths}

% I haven't done this. This is interview data and there are not many
% nested clauses---probably less than 1 in 3 and I don't think it would make much
% difference. Besides, it would be difficult to specify a set of
% criteria for cutting off within-a-clause---simply removing everything
% between the root and the first S would miss some nested clauses.

% All right, so maybe you really could just use parts of speech to
% tell. Even if it worked, I still don't think there would be much
% difference because very few of the important features without
% within-clause cutoff have multiple clauses--most are simple and
% non-nested. Oh, right. Simplifying to ignore clause nesting would
% increase the power of phenomena that happen regardless of nesting
% level but don't occur enough to be visible otherwise.

\subsection{Combining Feature Sets}
\label{combine-feature-sets}

Combining feature sets gives the classifier more information about a
site by combining the information that each feature set
captures. This dissertation uses a simple linear combination. In other
words, all features are counted together with equal
weight. This is easy and should allow the feature ranker
to find a greater variety of features that capture the
same underlying syntactic information.

% I don't think I'll do this (I probably won't even do the Volk thing
% since it involves a lot of parameterisation in icecore.h.
% Well, I *could* instead replicate each feature by its weight
% before combining, but this would probably make classification super
% slow because of the huge number of features.)
% ----
% Combinations of feature types will be ranked by
% averaging the number of significant distances that the constituent
% feature types produce.

\subsection{Alternate Distance Measures}
\label{alternate-distance-measures}

There are several reasons to test distance measures besides
$R$. There are a couple of a priori reasons for this: $R$ is fairly
simple, so more complicated variations on it may provide better
sensitivity at the expense of sensitivity to noise. Also, variations
explore the measure space better in case that $R$ is not significant
for some combination of corpus/feature set.

Post-hoc, there are interesting patterns of statistical significance
produced by the combination of distance measure and feature set. These
patterns are not trivially obvious. This is not expected, but may
provide insight into the measure/feature combination, which helps
resolving Hypotheses 1 and 2.

% Another possibility is a return to Goebl's Weighted Identity Value;
% this classifier is similar in some ways to $R$, but has not been
% tested with large corpora, to my knowledge at least. 

\subsubsection{Kullback-Leibler divergence}
\label{kl-divergence}

Kullback-Leibler divergence, or relative entropy, is described in
\namecite{manningschutze}.  Relative entropy is similar to $R$ but
more widely used in computational linguistics. The name relative
entropy implies an intuitive interpretation: it is the number of bits
of entropy incurred when compressing a site $b$ with the optimal
compression scheme for a second site $a$. Unless the two sites
are identical, the relative entropy $KL(a||b)$ is non-zero because
$a$'s optimal compression scheme will over-compress $b$'s
features that are more common in $a$ than in $b$, whereas it will
under-compress features that are less common in $a$ than in $b$.

For example, assume that site $a$ has two features with type counts
\{S-NP-N : 20, S-VP-PP-N : 10\}. An optimal compression scheme for $a$
would compress S-NP-N twice as much as S-VP-PP-N because it occurs
twice as often. However, if this compression scheme is used on a
site $b$ with the feature counts \{S-NP-N : 15, S-VP-PP-N : 15\},
efficiency will be worse; S-NP-N and S-VP-PP-N occur the same number
of times in $b$, so the smaller compressed size of S-NP-N will be used
less often than expected, while the larger compressed size of
S-VP-PP-N will be used more. This difference can be measured precisely
for each feature:

\[ a_i \log\frac{a_i}{b_i} \]

where $a_i$ is type count of the $i$th feature in $a$ and $b_i$
is the type count of the $i$th feature in $b$. This measures the
number of bits lost, or entropy, for each feature $i$. Like $R$'s
differences, the per-feature entropy can be summed to find the total
entropy. In the example above, the entropy for S-NP-N is $20
\log\frac{20}{15} = 5.75$.

However, Kullback-Leibler divergence as defined is a divergence: it
measures the divergence of features in the site $b$ from the
features of site $a$. A dissimilarity is required for
dialectology, which means that the divergence must additionally be
symmetric. A divergence can be made symmetric by calculating it twice:
the divergence from $a$ to $b$ added to the one from $b$ to
$a$. The complete formula is given in equation \ref{klmeasure} and
the complete example is worked in equation \ref{klexample}.

\begin{equation}
KL(a||b) = \sum_i {a_i \log\frac{a_i}{b_i} + b_i \log\frac{b_i}{a_i}}
\label{klmeasure}
\end{equation}

\begin{equation}
 (20 \log\frac{20}{15} + 15 \log\frac{15}{20}) + (10
  \log\frac{10}{15} + 15 \log\frac{15}{10}) = (5.75 - 4.32) + (-4.05 +
  6.08) = 3.46
  \label{klexample}
\end{equation}

\subsubsection{Jensen-Shannon divergence}

Several variants of relative entropy exist that lift various restrictions from the input
distributions. One is Jensen-Shannon divergence \cite{lin91}, which
was designed as a dissimilarity from the start. It uses the same
denominator for both directions: the average of the two
frequencies. That means that each feature's entropy is found using
the following formula:

\[ a_i \log\frac{b_i}{(a_i + b_i) / 2} + b_i \log\frac{a_i}{(a_i + b_i) / 2} \]

There is a common subexpression in this value: $(a_i + b_i) / 2$: the
average of the two features. If we let $\bar{c_i} = (a_i + b_i) / 2$
rewrite the formula to take advantage of this simplification, we get
equation \ref{jsmeasure}.

\begin{equation}
JS = \sum_i {a_i \log\frac{b_i}{\bar{c_i}} + a_i \log\frac{b_i}{\bar{c_i}}}
\label{jsmeasure}
\end{equation}

Unlike Kullback-Leibler divergence, Jensen-Shannon divergence does not
require that features exist in both sites being compared in order to
be counted. KL divergence cannot count unique features, in fact,
because if either $a_i$ or $b_i$ is zero, then it will divide by zero
at some point. The current implementation of KL divergence simply
skips zero values, which means it ignores features unique to a
particular site. Jensen-Shannon divergence avoids this problem because
it divides by $\bar{c_i}$, the average of the feature counts.
Because KL divergence ignores features unique to one site, it should
be less susceptible to noise than JS divergence. This can be useful in
the presence of unreliable annotators. However, KL divergence will
also produce less detail in the case that unique features are useful.

\subsubsection{Cosine similarity}

Cosine similarity is used in many parts of computational linguistics
and related areas such as information extraction and data
mining. \namecite{nerbonne06} use it as reference point for
comparison to previous work in these areas. Cosine similarity measures
the similarity between two high-dimensional points in space. Each
feature is modeled as a dimension, and the type count from each
site is plotted as a point on that dimension. In equation \ref{cosmeasure}, vectors $a$
and $b$ are multiplied, then divided by the product of their
lengths. This equation can be written in an element-by-element way as in
equation \ref{cosmeasureiterative}. Here, the vector multiplication is
written out as the sum of pairwise products. The vector length of $a$
and $b$ is written as a square root of sum of squares.

\begin{equation}
  \frac{a\cdot b} {||a||||b||}
  \label{cosmeasure}
\end{equation}

\begin{equation}
  \frac{\sum_i {a_ib_i}} {\sqrt{\sum_i {a_i^2}} \cdot \sqrt{\sum_i{b_i^2}}}
  \label{cosmeasureiterative}
\end{equation}

Interestingly, the results for this measure are somewhat different from
the other distance measures, possibly because, unlike the others
described, it is not a linear sum.

\section{Input Processing}
To investigate the first question, agreement with dialectology, I need
a dialect corpus that can be syntactically annotated
(\ref{syntactically-annotated-corpus}); if it is not already
annotated, it must be possible to annotate it automatically so I can
avoid time-consuming manual annotation. Automatic annotation requires
a syntactically annotated training corpus
(\ref{syntactically-annotated-training}) and parsers for the models of
syntax I use as a basis (\ref{parsers}).

% To investigate the second question, quality of features, I need a
% method to combine different types of features
% (\ref{combine-feature-sets}). I also need a way to generate new
% features that include more information about context
% (\ref{alternate-feature-sets}).

% If the distance measure $R$ doesn't provide any significant distances
% with any combination of features, I will experiment with different
% distance measures (\ref{alternate-distance-measures}). For this, there
% are quite a few possibilities; Kullback-Leibler divergence is one
% example.

\subsection{SweDiaSyn}
\label{syntactically-annotated-corpus}
% (CITE SweDiaSyn and ScanDiaSyn,
% except that they don't seem to have any references)
% Here is a citation for ScanDiaSyn if I could track it down and
% translate it
% Vangsnes, Øystein A. 2007. ScanDiaSyn: Prosjektparaplyen Nordisk dialektsyntaks. In T. Arboe (ed.), Nordisk dialektologi og sociolingvistik, Peter Skautrup Centeret for Jysk Dialektforskning, Århus Universitet. 54-72.

In order to find dialect distance between sites, a dialect corpus that
can be syntactically annotated is required.  The dialect corpus used
in this dissertation is SweDiaSyn, the Swedish part of the ScanDiaSyn.
SweDiaSyn is a transcription of SweDia 2000 \cite{bruce99}. SweDia
2000 was collected between 1998 and 2000 from 97 locations in Sweden
and 10 in Finland. Each location has 12 interviewees: three 30-minute
interviews for each of older male, older female, younger male and
younger female.  However, the SweDiaSyn transcriptions do not yet
include all of SweDia 2000; the completed transcriptions currently
focus on older speakers.

Currently there are 36,713 sentences of transcribed speech
from 49 sites, an average of 749 sentences per site.
However, the sites range from 110 to 1780 sentences because some sites
have fewer complete transcriptions than others.

In the SweDiaSyn, there are two types of transcription:
standard Swedish orthography, with glosses for words
not in standard Swedish; and a phonetic transcription for dialects
that differ greatly from standard Swedish. For this dissertation,
the orthographic/gloss transcription is used so that lexical
items are comparable across dialects. However,
only 30 of the 49 sites have been glossed, so the total usable size of
the corpus is 21,004 sentences, with an average of 700 sentences per
site. The sites range from 301 to 1,144 sentences.

\subsection{Talbanken}
\label{syntactically-annotated-training}

Because SweDiaSyn consists of unannotated lexical items only,
Talbanken05, a syntactically-annotated corpus, is used to train
the automatic annotators which in turn annotate SweDiaSyn. Talbanken05
is a treebank of written and transcribed spoken Swedish, roughly
300,000 words in size. It is an updated version of Talbanken76
\cite{nivre06}; Talbanken76's trees are annotated following a
scheme called MAMBA; Talbanken05 adds phrase structure annotation and
dependency annotation using the annotation formats TIGER-XML
and Malt-XML. In addition to syntactic annotation, Talbanken is
lexically annotated for morphology and part-of-speech.

\subsection{Parsing}
\label{parsers}
In order to build the language models described above, SweDiaSyn must
be POS tagged, constituency parsed and dependency parsed. This
allows the features to be extracted for use by the distance measure.

Before annotation, the SweDiaSyn sentences are cleaned in order to
improve the output of the parsers.  Cleaning the sentences consists of
removing restarts, stops, and mumbled words, which are all marked
in the transcription.

\subsubsection{Tags `n' Trigrams}
The Tags `n' Trigrams (T`n'T) tagger \cite{brants00} is used for tagging, with
the POS annotations from Talbanken05 used as training. T`n'T is an
efficient Markov-model trigram tagger, meaning that it uses only the
previous two words to decide on the part of speech for a word. It
backs off to even less context in the case of sparse data; if
the trigram composed of the current word and the previous two words has not been seen before, most of the decision will be based
on the current word and one previous word. T`n'T handles unknown words
by a simple form of suffix analysis---unknown words that have similar
endings to known words are more likely to get that tag.

\subsubsection{Berkeley Parser}

For constituency parsing, the Berkeley parser \cite{petrov06} is
trained on Talbanken05. The Berkeley
parser has shown good performance on languages other than English,
which is not common for constituency parsers \cite{petrov07}.
% Note: petrov 06 is a actually a lot better at explaining this.

The Berkeley parser learns latent annotations, which means that it learns
grammars from training data by assuming that the training gives a
coarser picture than the true grammar, but one that is also too
specialized to the observed sentences. For example, it may start with
the single category NP for noun phrases, which is too coarse to
capture the subject/object distinction. So the category will be split
into NP1 and NP2. However, because the splits are cutoff to a certain
level of frequency, not every random characteristic of the training
data is learned.

\subsubsection{MaltParser}

For dependency parsing, MaltParser will be used with the existing
Swedish model trained on Talbanken05 by Hall, Nilsson and
Nivre. Dependency parsing proceeds similarly to
constituency parsing; the dependency structures of Talbanken05 are
cleaned and normalized, then used to train a parser.

MaltParser is an inductive dependency parser that uses a machine
learning algorithm to guide the parser at choice points
\cite{nivre06b}. This means that the parsing algorithm is
deterministic. Although this sounds impossible for an ambiguous
language, it achieves this by relying on a machine learner to choose
the correct option at points where multiple options exist. The machine
learner is either a memory-based learner or a support vector machine
trained on a history-based model. This model uses the internal state
of the parser as features for training.

\section{Output Analysis}
\label{output-analysis}
After a distance measure has been defined for the interview sites
within the dialect corpus (see above, section \ref{nerbonne06}) and
syntactic features have been extracted (\ref{syntactic-features}), the
results must be tested for significance (\ref{permutationtest}). The
significant results must be analyzed by clustering
(\ref{cluster-analysis}) and multi-dimensional scaling (\ref{mds}) to
determine which dialect regions are found by the distance
measure. Finally, the most highly ranked features used to produce the
dialect distances must be enumerated (\ref{feature-ranking}).

\subsection{Permutation test}
\label{permutationtest}

To find out if a dialect distance value is significant as measured, a
permutation test with a Monte Carlo technique described by
\namecite{good95} is required, following closely the same usage by
\namecite{nerbonne06}. The intuition behind the technique is to
compare the distance between two sites with the distance between two
random subsets of the shuffled, combined sites. This shuffling and
subsets is repeated multiple times. If the random subsets' distance is
less than the distance of the two actual sites more than $p$ percent
of the time, then we can reject the null hypothesis that the two sites
were are actually drawn from the same dialect: that is, we can assume
that the two sites are different. The reason is that the distance
should be larger between the samples of the original corpora than the
distance between the random subsets: any real differences will be
randomly redistributed between both subsets by the shuffling
process.

To see how this works, for example, assume that the distance between
the two British regions London and Scotland is some value such as
100. The permutation test then shuffles London and Scotland to create
a combined site ``LSMixed'' and splits it into two random sites. Since the
real differences between London and Scotland are now mixed between the
two random subset, we would expect the distance between these two
subsets to be less than 100. This should be true at least 95\% of the
time for the distance $100$ to be significant according to the normal
threshold of significance, $p < 0.05$, in the social sciences.

\subsection{Cluster Analysis}
\label{cluster-analysis}
The first question, agreement with dialectology, requires a clustering
method to allow inter-site distances to be compared more easily. The
dendrogram that binary hierarchical clustering produces allows easy
visual comparison of the most similar sites. An example is given in
figure \ref{example-dendrogram}.

\begin{figure}
  \Tree[. [. [. A B ] C ] [. D E ] ]
 \caption{Hierarchical Cluster Dendrogram}
  \label{example-dendrogram}
\end{figure}

A clustering algorithm provides understanding of which sites group
together. There are a variety of clustering algorithms, but
hierarchical clustering is the most appropriate method for this
problem because it does not specify the number of groups ahead of
time. Other clustering algorithms such as k-means or expectation
maximization require the number of expected
clusters to be given. Hierarchical clustering creates its clusters implicitly by
grouping items using a binary merge operation. The merge is repeated
until a tree with a single root is formed. The implicit clusters can
be extracted by looking at which speakers share the same subtree, as
well as looking for large differences in internal node heights connecting subtrees.

The initial step for any clustering algorithm is to find distances
between all pairs of sites as described above.
These distances between all pairs of sites result in a set of
high-dimensional spatial relationships. While they could be analyzed
as such, such high-dimensional distances are difficult to
visualize. The job of a clustering algorithm is to reduce
the dimensionality and create a useful visualization of the relative
positions of the speakers. Hierarchical clustering does this by
creating a tree---if there are similarities between the speakers, it
should be obvious by looking at the tree.

There are some complications, however. Because the clustering
algorithm is nothing but repeated merges, it is
not always clear at what level the best clusters are formed. For example,
\namecite{clopper04} used a similar clustering algorithm on perceptual
dialect data from American English speakers and found two distinct
North/South clusters for most features. These two clusters had less
defined sub-clusters as well: the Western speakers of American English
usually grouped with the North cluster but slightly separated from the
other dialects in the North cluster. Of course as the sub-clusters
become smaller, they usually become less distinctive because the
distances are smaller. Ultimately, human judgment is necessary to
determine what is a cluster and what is not.

\subsubsection{Bottom-up hierarchical clustering}

With hierarchical clustering defined, ``bottom-up'' now needs a
definition. As in any tree-building problem, the two obvious ways one
can build a tree are top-down and bottom-up. Bottom-up clustering
works in the following way. First, each site is put into its own
group. Then the algorithm determines the distance between each pair of
groups. The closest two groups are merged into a single group and the
process is repeated until a single root group is created. This process
is bottom-up because it creates the terminal nodes of the tree first
and builds up the internal structure of the cluster tree from there.

For example, figures \ref{hierarchical-cluster-1} through
\ref{hierarchical-cluster-5} show the sequence of merges need to
produce the dendrogram in figure \ref{example-dendrogram}. On the
first step A and B are merged (figure \ref{hierarchical-cluster-2}),
followed by D and E (figure \ref{hierarchical-cluster-3}). Then C
merges with the A-B cluster (figure
\ref{hierarchical-cluster-4}). Finally the A-B-C cluster and the D-E
cluster merge to form a single tree (figure
\ref{hierarchical-cluster-5}).

\begin{figure}
  \includegraphics[scale=0.7]{hierarchical-cluster-1}
 \caption{Sites Before Clustering}
  \label{hierarchical-cluster-1}
\end{figure}

\begin{figure}
  \includegraphics[scale=0.7]{hierarchical-cluster-2}
 \caption{Sites After A-B Merge}
  \label{hierarchical-cluster-2}
\end{figure}

\begin{figure}
  \includegraphics[scale=0.7]{hierarchical-cluster-3}
 \caption{Sites After D-E Merge}
  \label{hierarchical-cluster-3}
\end{figure}

\begin{figure}
  \includegraphics[scale=0.7]{hierarchical-cluster-4}
 \caption{Sites After A-B-C Merge}
  \label{hierarchical-cluster-4}
\end{figure}

\begin{figure}
  \includegraphics[scale=0.7]{hierarchical-cluster-5}
 \caption{Sites After Clustering}
  \label{hierarchical-cluster-5}
\end{figure}

To find the two closest groups, \quotecite{ward63} method is used. At
each merge step, this method evaluates every possible binary
merge. Each merge is given a score that minimizes some objective
function---here, the average of distances between sites in
the new group. The best merge replaces its children and the process
repeats until a singly rooted tree is created. For $n$ sites, this
takes $n-1$ iterations, because at each step, two groups are merged.

\begin{table}
  \begin{tabular}{c|cccc}
    & B & C & D & E \\ \hline
    A & 10 & 20 & 40 & 50 \\
    B & & 20 & 50 & 40 \\
    C &&& 30 & 30\\
    D &&&& 12\\
  \end{tabular}
 \caption{Example dissimilarities}
  \label{cluster-distances}
\end{table}

For example, using the distances in table \ref{cluster-distances} for
the five example sites, the first merge is trivial since each site
starts in its own singleton tree (figure \ref{ward-cluster-1}): the
distance between A and B, 10, is the minimum and thus the best. This
produces the forest in figure \ref{ward-cluster-2}.

\begin{figure}
  \Tree[. A ]
  \Tree[. B ]
  \Tree[. C ]
  \Tree[. D ]
  \Tree[. E ]
 \caption{Ward's method, before clustering}
  \label{ward-cluster-1}
\end{figure}

\begin{figure}
  \Tree[. A B ]
  \Tree[. C ]
  \Tree[. D ]
  \Tree[. E ]
 \caption{Ward's method, after A-B merge}
  \label{ward-cluster-2}
\end{figure}

The distances between the A-B tree and the others are now more
complicated to calculate: the A-B-C merge has a distance of $10+20+20
/ 3$ = 16.6. This is smaller than the A-B-D merge ($10+40+50 / 3 =
33.3$), but larger than the D-E merge ($12 / 1 = 12$) which eventually
turns out to be the smallest merge, producing figure
\ref{ward-cluster-3}.

\begin{figure}
  \Tree[. A B ]
  \Tree[. C ]
  \Tree[. D E ]
 \caption{Ward's method, after D-E merge}
  \label{ward-cluster-3}
\end{figure}

The next merge is primarily concerned with where C will merge, whether
with A-B or D-E; an A-B-D-E merge is much larger at $10 + 40 + 50 + 50
+ 40 + 12 / 6 = 33.6$. As previously calculated, the A-B-C merge is
$16.6$, while a C-D-E merge is $30 + 30 + 12 / 3 = 24$. So the new
merge is A-B-C, producing figure \ref{ward-cluster-4}.

\begin{figure}
  \Tree[. [. A B ] C ]
  \Tree[. D E ]
 \caption{Ward's method, after A-B-C merge}
  \label{ward-cluster-4}
\end{figure}

The two remaining trees are merged. Here, the final value of the
objective function is the average of all distances in the table, that
is $302 / 10 = 30.2$. The final tree is given in figure
\ref{ward-cluster-5}.

\begin{figure}
  \Tree[. [. [. A B ] C ] [. D E ] ]
 \caption{Ward's method, after clustering}
 \label{ward-cluster-5}
\end{figure}

Ward's method is less efficient than other common clustering methods,
but it usually finds small, round clusters, making it worth the extra
computer time. In contrast, single-link distance, for example,
compares only the two closest elements of the two members of a
possible merge. This is faster, but is susceptible to creating thin,
oval groups---even though the bulk of a group may be distant, a single
outlier usually leads to a bad grouping, which recursively leads to
further outliers.

\subsubsection{Consensus Trees}

A weakness of cluster dendrograms is that small variations in
distances can cause large changes in the cluster membership of
sites. Consensus trees circumvent this weakness by combining the
results of multiple related dendrograms. Only the clusters that occur
in the majority of dendrograms appear in the consensus tree. For a
survey of consensus trees, see chapter 6 of \cite{bryant97}.

Consensus trees can be constructed by an algorithm with three primary
steps. First, the algorithm finds the spans of every internal
node in every tree. That is, each non-terminal node $N$ is replaced
with the terminal nodes $w_i \ldots w_j$ that it dominates. Second,
the algorithm counts span types and retains only the spans that occur
in a majority of dendrograms. For example, if there are $9$
dendrograms, the consensus tree will contain only spans that occur in
$5$ or more of them. Third, the spans are reconstructed into a single
tree. Nodes that no longer have a direct parent are added to a higher
ancestor.

\begin{figure}
  \Tree[. A [. [. [. B C ] D ] E ] ]
  \Tree[. A [. [. B [. C D ] ] E ] ]
  \Tree[. A [. B [. [. C D ] E ] ] ]
  \caption{Input cluster dendrograms}
  \label{consensus-example-input}
\end{figure}

\begin{figure}
  \Tree[. A [. [. B C D ] E ] ]
  \caption{Output consensus dendrogram}
  \label{consensus-example-output}
\end{figure}

For example, consider the three hierarchical dendrograms of figure
\ref{consensus-example-input}, adapted from the example given by
\namecite{amenta03}. They cluster the input set \{A B C D E\} three
different ways. The majority-rule consensus tree for these three trees
is given in figure \ref{consensus-example-output}.

The spans for the internal nodes of the three trees are given in
figure \ref{consensus-tree-spans}. There is quite a bit of overlap at
higher levels, but near the leaves, \{B C\} and \{C D\} vary, as
do \{B C D\} and \{C D E\}. As a result, when the spans are
combined and counted, \{B C\} and \{C D E\} only appear once. This
means that they will be dropped from the consensus tree because they
appear in 1/3 of the trees and because 1/3 is less than or equal to 1/2,
they are not majority spans.

\begin{figure}
\begin{tabular}{l|l|l}
\hline
\{B C\}&       \{C D\}    &        \{C D\}           \\
\{B C D\}& \{B C D\}  &          \{C D E\}       \\
\{B C D E\}&  \{B C D E\} &  \{B C D E\}  \\  
\{A B C D E\}& \{A B C D E\}& \{A B C D E\}\\ 
\end{tabular}
\caption{Spans from input trees}
\label{consensus-tree-spans}
\end{figure}

\begin{figure}
  \begin{tabular}{l|ll}
  \{B C\} & 1 / 3 & * \\
  \{C D\} & 2 / 3 \\
  \{B C D\} & 2 / 3 \\
  \{C D E\} & 1 / 3 & * \\
  \{B C D E\} & 3 / 3 \\
  \{A B C D E\} & 3 / 3 \\
  \end{tabular}
  \caption{Span type frequencies (starred rows do not occur in the majority
  of trees)}
  \label{consensus-tree-span-types}
\end{figure}

\begin{figure}
  \begin{tabular}{l}
    \{C D\} \\
    \{B C D\} \\
    \{B C D E\} \\
    \{A B C D E\} \\
  \end{tabular}
  \caption{Majority span types}
  \label{consensus-tree-majority-spans}
\end{figure}

Reconstruction is fairly simple; taken from the top down, both \{A\}
and \{B C D E\} must be children of \{A B C D E\}. Because \{A\} is
not a member of the majority spans, it is added directly as a child of
\{A B C D E\}. Although this occurs in all three original trees, this
is not the case when \{B C D\} adds \{B\} as a child. The result is
that \{B C D\} has ternary branching, which was not present in any of
the original trees.

As can be seen from the example, high degrees of branching in the
consensus tree near the leaves indicate that the original trees do not
agree well. Therefore, it is not safe to draw conclusions from an
original tree in the areas where it disagrees with other original
trees.

\subsubsection{Composite Clustering}
\label{methods-composite-clustering}
A visual alternative to a consensus tree is composite, or fuzzy,
clustering \cite{kleiweg04}. Instead of removing clusters that do not appear
in the majority of cluster trees, fuzzy clustering plots every cluster
tree on a map completely. However, each tree is plotted
transparently. If a large number of trees agree on a cluster, the
cluster will be plotted many times, creating a dark line. Conversely,
clusters without wide agreement will only get a light line. This provides a
graphical equivalent to consensus trees.

To make this work, hierarchical clustering must change slightly: the
input is a diagonal matrix of distances as before, but the output is
no longer a binary tree but a diagonal matrix of distances, like the
input. The distances between two sites are now the number of clusters
that separate them. % (I think, I'm not sure I understood this part).

You can then draw this directly on a map: put a line equidistant
between each pair of sites, making it darker the more clusters that
separate them.

But this still relies on hierarchical clustering, which is not very
stable. To work around this, I use multiple hierarchical cluster trees
based on the parameter variations described above in section
\ref{methods-chapter-dialectometry-section}. Each set of parameters
generates one dendrogram, which produces a matrix of
cluster-separation counts for each parameter variant in feature set,
distance measure, normalization count, and sampling method. I average
cluster-separation count matrices and scale the line darkness
accordingly to get a more stable picture of which boundaries are
important.

\subsection{Multi-dimensional scaling}
\label{mds}

Multi-dimensional scaling (MDS) is an alternate approach to making the
high-dimensional dissimilarities more easily interpretable. Instead
of creating a tree, multi-dimensional scaling reduces the
high-dimensional dissimilarities to 3 dimensions, which can
then be represented using (Red,Green,Blue) color triples. When
painted on a map, these colors provide a nice visualization of
the regions that similar sites form as well as how sharp the boundaries are
with other regions.

MDS of dissimilarities uses \quotecite{kruskal64a} method to reduce
$m$ dissimilarities to an $n$ dimensional space, where $n < m$. Since
dissimilarities do not satisfy the triangle inequality, the mapping to
lower-dimensional space will require some change of the
dissimilarities. The question is how to minimize the change. Kruskal's
method defines a global measure of how much change is required called
Stress. An initial Stress is obtained by sorting the dissimilarities
by size and measuring how far each dissimilarity would have to change
in order for all the dissimilarities to be mapped to points in
n-dimensional space.  This initial stress is reduced by a process of
gradient descent as described in \cite{kruskal64b}, which ultimately
results in the minimum overall distortion of the dissimilarities.

\begin{table}
  \begin{tabular}{c|cccc}
    & B & C & D & E \\ \hline
    A & 10 & 20 & 40 & 50 \\
    B & & 20 & 50 & 40 \\
    C &&& 30 & 30\\
    D &&&& 12\\
  \end{tabular}
 \caption{Example dissimilarities}
  \label{mds-distances}
\end{table}

\begin{figure}
  \includegraphics[scale=0.4]{Sverigekarta-mds-1-1000-js-trigram-ratio}
 \caption{Swedia, Multi-Dimensional Scaling of Trigrams measured by
    Jensen-Shannon divergence}
\end{figure}


\subsection{Correlation}

Correlation is useful on two levels. First, correlation between
dialect distance and other measures provides an idea of how well
dialect distance agrees with these other measures. Although
traditional dialectology measures do not provide numeric distances
to correlate with syntactic dialect distance, other distances, such as
geographic distance, travel distance, and phonological dialect
distance do. These correlations provide circumstantial evidence that
the distances are valid. Second, it is interesting to measure
correlation between varying combinations of feature set and distance
measure. Disparate combinations may end up by producing distances that
correlate highly.

For geographic and travel distance, distances were obtained from Bing
Maps. Travel distance is not guaranteed to be completely accurate, but
even so should correlate more highly with dialect distance than
straight-line geographic distance. Still, travel
distance was defined in terms of Sweden's roads and ferries in
2010. This may differ from the travel distances over the time period
for which dialect usage was strongest.  \namecite{gooskens04a}
estimates Norwegian travel times for 1900; a similar estimate for
Sweden would probably give even higher correlation.


\begin{equation}
  r_{ab} = \frac{\sum^n_{i=1}(a_i - \bar{a})(b_i - \bar{b})}
  {\sqrt{\sum^n_{i=1}(a_i - \bar{a})^2}
    \sqrt{\sum^n_{i=1}(b_i - \bar{b})^2}}
  \label{pearsons-r}
\end{equation}

Correlation uses Pearson's $r$, which is defined in equation
\ref{pearsons-r} \cite{aron06}. Pearson's $r$ takes two vectors,
called $a$ and $b$ here. In the equation, $\bar{a}$ and $\bar{b}$ are
the averages of $a$ and $b$ respectively.  Like distance itself,
correlation must be checked for significance. Mantel's test provides
tests Pearson's $r$ for significance between inter-connected sites
\cite{mantel67}. Mantel's test is much like the permutation test for
significance described above.  It first finds the correlation between
two sets of distances.  One distance result set is permuted repeatedly
and at each step correlated with the other set. The original
correlation is significant if the permuted correlation is lower than
the original correlation more than 95\% of the time.

\subsection{Feature Ranking}
\label{feature-ranking}

Feature ranking is needed to compare distances qualitatively
to the Swedish dialectology literature; the most important features
should be similar to those discussed most by dialectologists when
comparing regions. Without feature ranking of some kind, there is no
way to relate the quantitative distances between sites with the
features that contribute most to those distances.

A simple feature ranking for distance is easy for
one-to-one site comparisons; each feature's normalized weight is
equal to its importance in determining the distance between the two
sites. See figure \ref{feature-ranking-1-1}: features that appear
more often in one site of the compared pair are negative, while features
that appear more often in the other site are positive. In the
example, the bigram AB-AJ occurs more often in the left-hand-site,
while NN-VB and PN-VB occur more often in the right-hand-site. This
is the same as the first step of $R$ and $R^2$: $R$ then takes the
absolute value of this difference and $R^2$ takes the square.

\begin{figure}
  \includegraphics[scale=0.8]{feature-ranking-1-1}
  \caption{Feature-ranking 1:1}
  \label{feature-ranking-1-1}
\end{figure}

Features can be ranked between a single site and multiple site by
averaging. For example, in figure \ref{feature-ranking-1-many}, the
binary comparison between the left-hand site and each of the three
right-hand sites produces three sets of features. The features can
be combined by averaging the score for each feature type. NN-VB's
averaged score would be $50 + 20 + 80 / 3 = 50$, for example.

\begin{figure}
  \includegraphics[scale=0.8]{feature-ranking-1-many}
 \caption{Feature-ranking 1:Many}
  \label{feature-ranking-1-many}
\end{figure}

This average can be extended to compare two sets of sites. As can be
seen in figure \ref{feature-ranking-many-many}, each feature on the
right-hand side is no longer a single number, but an average of the
comparison against each sites on the left-hand side. Therefore, in
this example, NN-VB's overall average score is

\[ \bigg(\frac{50+30}{2} + \frac{20+30}{2} + \frac{80+30}{2}\bigg) / 3 =
\frac{50+30+20+30+80+30}{6} = 40\]

\begin{figure}
  \includegraphics[scale=0.8]{feature-ranking-many-many}
  \caption{Feature-ranking Many:Many}
  \label{feature-ranking-many-many}
\end{figure}

\subsubsection{Wiersma's Measure of Feature Overuse}
\label{wiersma-normalization}

\namecite{wiersma09} uses a method similar to the one described above,
but with an additional normalization intended to show which features
are used relatively more in one or the other of the two sites to be
compared. This normalization is similar to the second step of the
normalization described in section \ref{normalization}: it also
removes the effect of frequency. However, this normalization removes
the effect of frequency from the difference of the two sites, whereas
the second removes it from the distance between two sites. Both
normalizations are similar in being of limited value for the noisier
data, automatically annotated. However, the results could improve if
the data are sufficiently clean.
% But how can we know?!

The normalization centers the two counts around 1.0 by dividing each count
by the average count, scaled by the total size of the two sites. The equation
for a feature $i$ with paired counts $a_i$ and $b_i$ is given in
equation \ref{overuse-norm} for site $a$. There,
$N_a$ and $N_b$ are the sizes of the sites $a$ and $b$, and $N$
is the combined size of the two sites.

\begin{equation}
  o_{ai} = a_i / \frac{(a_i + b_i)N_a}{N_a + N_b}
  \label{overuse-norm}
\end{equation}

The equation can be simplified slightly; equations of this form for
both sites $a$ and $b$ are given in equation
\ref{overuse-norm-simple}.

\begin{equation}
  o_{ai} = \frac{a_i(N_a + N_b)}{(a_i + b_i)N_a} \textrm{  and  }
  o_{bi} = \frac{b_i(N_a + N_b)}{(a_i + b_i)N_b}
  \label{overuse-norm-simple}
\end{equation}

For example, the previous one-to-one example shows that NN-VB occurs
20 more times in site $b$ than in $a$. Let this arise from the
counts $a_{\textrm{NN-VB}} = 10$ and $b_{\textrm{NN-VB}}=30$. Also
let the site sizes for $a$ and $b$ be $N_a = 40$ and $N_b = 100$
respectively. For the overuse-normalized numbers, this gives
\[o_{a\textrm{NN-VB}} = \frac{10(100+40)}{(10+30)40} = 1400 / 1600 =
0.875\] and \[o_{b\textrm{NN-VB}} = \frac{30(100+40)}{(10+30)100} = 4200
/ 4000 = 1.05\]

With the frequencies turned into ratios of overuse, it is possible to
see that the additional 20 occurrences in site $b$ are not that
important; they only give $o_{b\textrm{NN-VB}}$ a value of
1.05. Unlike frequencies, identical overuses occur between pairs
of the same ratio rather than the same difference. In other words,
$a_i=2,b_i=12$ gives the same overuse, $o_{b}=1.2$, as
$a_j=10,b_j=60$, because in both cases
$b=6a$. In contrast, frequency comparison gives a difference
of $b_i - a_i = 10$ and $b_j - a_j = 50$, making the
second difference much more important.

\section{Conclusion}

Besides giving an overview of the previous work, this chapter covered
the methods used in this dissertation, which fall into three
categories: distance measure methods, methods for processing dialect
corpora to be used with a distance measure, and methods for processing
distances to be compared with dialectology results. The last two,
while not the focus of the research, are complex, partly because of
the dissimilarity between the required inputs and outputs and partly
because dialectology presents its results in many forms, meaning that
many methods are required to produce comparable forms from
dialectometry distances.  The results, in chapter
\ref{results-chapter}, mirror the order of the last two sections, with
the bulk devoted to the output processing that produces maps that are
comparable to dialectology.

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: "dissertation.tex"
%%% End: