results.tex

\chapter{Results}
\label{results-chapter}

These results are meant to answer two main questions: first, how well does
this approach to syntactic dialectometry agree with dialectology?
Second, what combinations of distance measures, feature
sets and other settings produce the best results for linguistic
analyses? Additionally, the results are meant to allow comparison with
phonological dialectometry.

The organization of this chapter mirrors the order of the methods
chapter, particularly the output analysis (section
\ref{output-analysis}). First, there is an overview of the different
parameter settings, the combinations of distance measure and feature
set, as well as other settings (section
\ref{section-parameter-settings}). Then the number of significant
distances for each parameter setting is given (section
\ref{section-significant}), which is followed by the correlation with
geography and travel distance for each parameter setting (section
\ref{section-correlation}). These sections focus mainly on detecting
which settings do not produce valid results, so that they can be
ignored in the rest of the chapter. At a high level, they answer the
question of the suitability of statistical syntactic dialectometry:
whether or not significant results can be found.

Next, the specific dialectological results are examined. First,
cluster dendrograms provide a visualization of which sites the
distance measures find to be similar (section
\ref{section-clusters}). In addition, to improve the reliability of
the dendrograms, consensus trees (section \ref{section-consensus}) and
composite cluster maps are produced (section
\ref{section-composite-cluster}). Next, multi-dimensional scaling
gives a smoother view of similarity than clusters (section
\ref{section-mds}). Finally, features are ranked and extracted from
each cluster in the consensus tree (section \ref{section-features}).

\section{Parameter Settings}
\label{section-parameter-settings}

There are 180 parameter settings investigated in this chapter. This
number arises from the four parameters: measure, feature set, sampling
method and number of normalization iterations. 5 measures, 9 feature
sets, 2 sampling methods and 2 iterations of normalization
gives $5\times 9 \times 2 \times 2=180$ different settings. The
settings are given in table \ref{parameter-settings}.

\begin{table}
\begin{tabular}{|c|} \hline
  Feature Set \\\hline
  Leaf-Ancestor Path \\
  Part-of-speech Trigram \\
  Leaf-Head Path \\
  Phrase Structure Rule \\
  PSR with Grandparent \\
  Part-of-speech Unigram \\
  Leaf-Head Path, based on Timbl training \\
  Leaf-Arc Path \\
  All features combined \\ \hline
\end{tabular}
\begin{tabular}{|c|} \hline
  Measure \\ \hline
  $R$ \\
  $R^2$ \\
  Kullback-Leibler divergence \\
  Jensen-Shannon divergence \\
  cosine dissimilarity\\\hline \hline
  Sampling Method \\ \hline
  1000 sentences \\
  All sentences \\ \hline \hline
  Iterations of normalization \\ \hline
  1 \\
  5 \\ \hline
\end{tabular}
\vspace{5mm}
\caption{Settings for the five parameters tested}
\label{parameter-settings}
\end{table}
% Actually, all this should probably go in methods too, somewhere as a summary.

In addition, the size of each of the 30 interview sites are given in
table \ref{corpus-size}.

\begin{table}
\begin{tabular}{c|cc|c|cc}
      Site & Sentences & Words & Site & Sentences & Words \\\hline
     Ankarsrum &  630 &  7708 & Leksand &  923 &   10676\\
    Anundsjo &  1144 &   11897 &  Loderup &  429 &   7850\\
    Arsunda &  937 &   8933 & Norra Rorum &  546 &  9160\\
     Asby &  693 &   7171 & Orust &  1067 &   11409\\
     Bara &  696 &   10724 & Ossjo &  481 &   12275\\
     Bengtsfors &  663 &   7423 & Segerstad &  837 &   9746\\
    Boda &  1029 &   17425 &  Skinnskatteberg &  730 &   9529\\
     Bredsatra &  360 &   6938 & Sorunda &  768 &   11144\\
     Faro &  659 &   8260 & Sproge &  381 &   4399\\
     Floby &  557 &   6392 & StAnna &  876 &   13156\\
     Fole &  727 &   9920 & Tors\aa{}s &  374 &   9217\\
     Frillesas &  572 &   9634 & Torso &  956 &   15577\\
    Indal &  1126 &   13090 &  Vaxtorp &  903 &   11353\\
     Jamshog &  301 &   8661 & Viby &  431 &   6734\\
     K\"ola &  528 &   10133 & Villberga &  680 &   11479\\
\end{tabular}
  \caption{Size of Interview Sites}
  \label{corpus-size}
\end{table}

\section{Significant Distances}
\label{section-significant}

Significant distances help answer the question whether a syntactic
measure has succeeded in finding reliable distances; the measure will
always return some distance, but if the sites are too small, it may
not be significant. Therefore the results should have few
non-significant distances. In the tables, the total number of
comparisons between all 30 sites is the $435=30(30-1) / 2$. In the
first set, tables \ref{sig-1-1000} -- \ref{sig-1-full}, the results
are shown from one iteration of the normalization step. In the second
set, tables \ref{sig-5-1000} -- \ref{sig-5-full}, the results from
five normalization iterations are shown.

Bold numbers in the tables indicate that fewer than 95\% of the
distances were significant. In table \ref{sig-5-full}, the
5-iteration table that compares full sites, the only combination
with {\it less} than 5\% non-significant results is cosine
dissimilarity with unigram features, marked with italics. Note that
here, 5\% is an arbitrary cutoff point not related to the usual
significance cutoff $p < 0.05$; the basis for these tables are
themselves number of significant distances found.

\begin{table}
\begin{tabular}{l|rrrrr}
  & $R$ & $R^2$ & KL & JS & cos  \\ \hline
  Leaf-Ancestor &0&0&11&0&0 \\
  Trigram &0&0&0&0&0 \\
  Leaf-Head &0&0&0&0&0 \\
  Phrase-Structure Rules &0&0&\textbf{95}&0&0 \\
  Phrase-Structure with Grandparents &0&0&\textbf{273}&0&0 \\
  Unigram &0&0&0&0&0 \\
  Leaf-Head with MaltParser trained by Timbl &0&0&\textbf{47}&0&0 \\
  Leaf-Arc Labels&0&0&0&0&0 \\
  All Features Combined &0&0&0&0&0 \\
\end{tabular}
\caption{Number of non-significant distances for sample size 1000, 1
  normalization}
\label{sig-1-1000}
\end{table}

\begin{table}
\begin{tabular}{l|rrrrr}
& $R$ & $R^2$ & KL & JS & cos  \\ \hline
  Leaf-Ancestor&7&11&12&\textbf{35}&9 \\
  Trigram&4&1&0&\textbf{24}&1 \\
  Leaf-Head&10&12&20&\textbf{44}&19 \\
  Phrase-Structure Rules&\textbf{26}&17&\textbf{24}&\textbf{49}&20 \\
  Phrase-Structure with Grandparents&\textbf{58}&\textbf{35}&\textbf{38}&\textbf{71}&\textbf{33}
   \\
  Unigram&1&2&0&0&2 \\
  Leaf-Head with MaltParser trained by Timbl&11&21&18&\textbf{74}&\textbf{30}
   \\
  Leaf-Arc Labels&14&19&\textbf{37}&\textbf{94}&17 \\
  All Features Combined&0&0&1&8&2 \\
\end{tabular}
 \caption{Number of non-significant distances for complete sites, 1
   normalization}
 \label{sig-1-full}
\end{table}

\begin{table}
\begin{tabular}{l|rrrrr}
& $R$ & $R^2$ & KL & JS & cos  \\ \hline
  Leaf-Ancestor&5 & \textbf{56} & \textbf{34} & 0 & 0\\
  Trigram&3 & 2 & 0 & 0 & 0\\
  Leaf-Head&3 & 14 & 4 & 0 & 0\\
  Phrase-Structure Rules&11 & 4 & \textbf{66} & 1 & 0\\
  Phrase-Structure with Grandparents&18 & 0 & \textbf{109} & 4 & 0\\
  Unigram&\textbf{52} & \textbf{53} & 15 & 17 & 0\\
  Leaf-Head with MaltParser trained by Timbl&7 & 20 & \textbf{45} & 0 & 0\\
  Leaf-Arc Labels&6 & \textbf{54} & 17 & 1 & 0\\
  All Features Combined&0 & 4 & 0 & 0 & 0\\
\end{tabular}
\caption{Number of non-significant distances for sample size 1000, 5
  normalizations}
\label{sig-5-1000}
\end{table}
\begin{table}
\begin{tabular}{l|rrrrr}
& $R$ & $R^2$ & KL & JS & cos  \\ \hline
  Leaf-Ancestor&\textbf{290} & \textbf{284} & \textbf{287} & \textbf{278} & \textbf{204}\\
  Trigram&\textbf{284} & \textbf{283} & \textbf{283} & \textbf{276} & \textbf{196}\\
  Leaf-Head&\textbf{293} & \textbf{286} & \textbf{285} & \textbf{279} & \textbf{211}\\
  Phrase-Structure Rules&\textbf{289} & \textbf{294} & \textbf{286} & \textbf{275} & \textbf{236}\\
  Phrase-Structure with Grandparents&\textbf{285} & \textbf{290} & \textbf{286} & \textbf{270} & \textbf{258}\\
  Unigram&\textbf{297} & \textbf{296} & \textbf{294} & \textbf{293} &
  \textit{9}\\
  Leaf-Head with MaltParser trained by Timbl&\textbf{294} & \textbf{289} & \textbf{288} & \textbf{284} & \textbf{222}\\
  Leaf-Arc Labels&\textbf{294} & \textbf{290} & \textbf{291} & \textbf{293} & \textbf{162}\\
  All Features Combined&\textbf{279} & \textbf{279} & \textbf{279} & \textbf{269} & \textbf{191}\\
\end{tabular}
 \caption{Number of non-significant distances for complete sites, 5
   normalizations}
 \label{sig-5-full}
\end{table}

Analysis of the significance of dialect distance provides a measure of
how reliable the distances to be analyzed later in this chapter are. A
distance that does not find significant distances between the 30
sites is not suitable for precise inspection, although small numbers
of non-significant distances will still allow methods to
return interpretable results.

The highest number of significant distances are found in the first
case (table \ref{sig-1-1000}): 1 round of normalization with a
fixed-size sample of 1000 sentences. From there, both full-site
comparisons (table \ref{sig-1-full}) and 5 rounds of normalization
(table \ref{sig-5-1000}) have fewer significant distances, although
the number is still usable. However, the combination of the two, with
5 rounds of normalization over full-site comparisons, has only one
combination with fewer than 5\% of distances that are {\it not}
significant. Although both full-site comparisons and multiple rounds
of normalization may increase the precision of the results, their
combined effect on significance is so detrimental that its results are
useless. For the rest of the analysis, the combination of full-site
comparison and 5 rounds of normalization will be skipped.

\subsection{Significance by Measure}

The distance measures most likely to find significance are, in order,
cosine dissimilarity, Jensen-Shannon divergence and $R$. Each method
had different parameter settings for which it was stronger. For
1000-sentence sampling, tables \ref{sig-1-1000} and \ref{sig-5-1000},
cosine similarity resulted in all significant distances, even for
part-of-speech unigrams, which are intended as the baseline feature
set. Excluding unigrams, Jensen-Shannon divergence has similar
performance. For full-site comparisons, tables \ref{sig-1-full} and
\ref{sig-5-full}, both perform considerably worse; surprisingly, both
perform better on unigram features, Jensen-Shannon so much so that it
is the only feature set for which it finds all significant
distances. $R$, on the other hand, performs decently on all
combinations of parameter settings; its low significance for phrase
structure rules is shared by Kullback-Leibler and Jensen-Shannon
divergences.

When comparing the performance of Kullback-Leibler and Jensen-Shannon
divergence it is not surprising that Jensen-Shannon outperforms
Kullback-Leibler on fixed-size sampling. Although both are called
``divergence'', Jensen-Shannon divergence is actually a
dissimilarity. Recall that the divergence from point A to B may differ
from the divergence from point B to A. A divergence like
Kullback-Leibler can be converted to a dissimilarity by measuring
$KL(A,B) + KL(B,A)$. However, this dissimilarity must skip features
unique to a single site in order to avoid division by zero. This
means that for smaller sites Kullback-Leibler loses information that
Jensen-Shannon is able to use.  On the other hand, while this may
explain Kullback-Leibler's improved performance for full-site
comparisons, it doesn't explain Jensen-Shannon's much worse
performance.

\subsection{Significance by Feature Set}

% \item Unigrams do form an adequate baseline; they are bad but not too
%   bad.

% The feature sets most likely to find significance are the combined
% features and unigrams., in order,
% trigrams, all combined features and leaf-head paths (both with
% support-vector-machine training and with Timbl's instance-based
% training). Without ratio normalization, the other feature sets are not
% much worse, but with it included, these three are the best by some
% distance.

For 1 round of normalization, the best feature sets are the simple
ones: trigrams and unigrams, as well all combined features. On the
other hand, trigrams and leaf-head paths (with its variations) are the
best feature sets with 5 rounds of normalization. However, the
variation isn't strong; any feature set can give good results with the
right distance measure. The problem is that no clear patterns emerge.

The relatively high quality of trigrams and unigrams does not make
sense given only the linguistic facts; however, it is likely that the
entirely automatic annotation used here introduces more and more
errors as more annotators run, operating on previous automatic
annotations. Trigrams are the result of only one automatic annotation,
and one for which the state of the art is near human performance. So
the fact that these particular parts of speech are of higher quality
than the corresponding dependencies or constituencies is probably the
deciding factor in their higher number of significant
distances.

% Although it is impossible to tell from my results, I
% predict that a manually annotated dialect corpus would show that
% non-flat syntactic structure is helpful in producing significant
% distances.

Given the above facts, the question should rather be: why do leaf-head
paths perform as well as they do? Better, for example, than the
leaf-ancestor paths on which they're modeled: why does more
normalization hurt leaf-ancestor paths but not leaf-head paths?  It
could be that there is less room for error; many of the common
leaf-head paths are short: short interview sentences with simple
structure make for shorter leaf-head paths than leaf-ancestor
paths. As a result, the important leaf-head paths consist mainly of a
couple of parts-of-speech. This difference in feature length holds for
any length of sentence, but is exaggerated for simple sentences, where
the amount of structure generated for a phrase-structure parse for a
clause is more than for a dependency parse. In general, clauses,
embedded and otherwise, produce the largest difference in amount of
structure between the two, so the feature length differs for deeply
nested sentences as well.

Another reason could be a difference in parsers: MaltParser has been
tested on Swedish by its designers \cite{nivre06b}. Besides English, the Berkeley
parser has been tested prominently on German and Chinese. Therefore,
the difference would better be explained by appealing to the
difference in parsers rather than an unsuitability of Swedish for
constituent analysis.

It is disappointing linguistically that trigrams provide the most
reliable results so far; a linguist would expect that including
syntactic information would make it easier to measure the differences
between sites. If it is, as hypothesized here, an effect of chaining
machine annotators, a study using a manually annotated corpus could
detect this. However, it still means that trigrams are the most useful
feature set from a practical view, because automatic trigram tagging
is very close to human performance with little training. That means
the only required human work is the transcription of interviews in
most cases.

On the other hand, if additional features sets are to be developed for
a corpus, then combining all available features seems to be a
successful strategy. The distance measures seem to be able to use all
available information for finding significant distances.


\section{Correlation}
\label{section-correlation}

In dialectology, the default expectation for dialect distance is that
it correlates with geographic distance \cite{chambers98}. A lack of
correlation does not necessarily mean that a measure is invalid, but
presence of correlation means that the distance measure substantiates the
well-known tendency of dialect distributions to be more or less
smoothly gradient over physical space.

In addition, distance measures are more likely to correlate
significantly with travel distance than with straight-line geographic
distance. This makes sense since the difficulty of moving from place
to place is what influences dialect formation, and taking roads into
account is an improved estimate over straight-line distance.

The tables that present geographic and table correlation,
\ref{cor-1-1000} -- \ref{travel-cor-5-full}, mark significant
correlations with a star for $p < 0.05$, two stars for $p < 0.01$ and
three stars for $p < 0.001$. However, these correlations are only
trustworthy in the case that the underlying distances are
significant. Significant correlations from significant distances (as
cross-referenced from tables \ref{sig-1-1000} -- \ref{sig-5-full}) are
marked by italics.

Besides this, correlation between combinations of measure/feature set
can show how closely related they are--in other words, how similarly
they view the underlying data which remains the same for all. It is
analyzed in section \ref{results-chapter-inter-measure-correlation}.

This is similar to the reasoning behind correlation with
geography---but the assumption is that geography is a factor
underlying dialect formation; while the distance measure measures some
aspect of the language which we hope is dialects, it is indirectly
(even less directly) measuring the geography. Therefore, correlation
with geography should occur.

Third, correlation with corpus size is not predicted and is probably
an undesired defect in sampling or normalization. Correlation with
corpus size is presented in tables \ref{size-cor-1-1000} --
\ref{size-cor-5-full}.

\begin{table}
\begin{tabular}{l|rrrrr}
& $R$ & $R^2$ & KL & JS & cos  \\ \hline
  Leaf-Ancestor&-0.01 & 0.03 & 0.02 & -0.02 & 0.08\\
  Trigram&0.17 & 0.17 & 0.10 & 0.19 & 0.13\\
  Leaf-Head&-0.06 & 0.03 & 0.00 & -0.07 & 0.05\\
  Phrase-Structure Rules&0.01 & \textit{0.18*} & 0.16 & 0.01 & 0.12\\
  Phrase-Structure with Grandparents&0.03 & \textit{0.25*} & 0.21* & 0.03 & 0.12\\
  Unigram&\textit{0.18*} & 0.17 & \textit{0.29**} & \textit{0.30**} & \textit{0.18*}\\
  Dependencies, MaltParser trained by Timbl&-0.07 & 0.02 & -0.00 & -0.08 & 0.05\\
  Arc-Head&-0.07 & 0.06 & -0.06 & -0.09 & 0.00\\
  All Features Combined&-0.02 & 0.03 & 0.01 & -0.02 & 0.07\\
\end{tabular}
 \caption{Geographic correlation for sample size 1000, 1 normalization iteration}
 \label{cor-1-1000}
\end{table}

\begin{table}
\begin{tabular}{l|rrrrr}
& $R$ & $R^2$ & KL & JS & cos  \\ \hline
  Leaf-Ancestor&0.02 & 0.09 & 0.11 & -0.00 & 0.09\\
  Trigram&\textit{0.27*} & \textit{0.26*} & \textit{0.30**} & 0.21* & 0.08\\
  Leaf-Head&-0.03 & 0.12 & 0.14 & -0.06 & 0.02\\
  Phrase-Structure Rules&0.13 & \textit{0.36**} & 0.30** & 0.11 & \textit{0.20*}\\
  Phrase-Structure with Grandparents&0.15 & 0.41** & 0.36** & 0.14 & 0.19*\\
  Unigram&\textit{0.20*} & \textit{0.20*} & \textit{0.33**} & \textit{0.33**} & \textit{0.22*}\\
  Dependencies, MaltParser trained by Timbl&-0.02 & 0.14 & 0.16 & -0.05 & 0.02\\
  Arc-Head&-0.06 & 0.13 & -0.01 & -0.12 & -0.03\\
  All Features Combined&0.03 & 0.11 & 0.16 & -0.00 & 0.04\\
\end{tabular}
 \caption{Geographic correlation for complete sites, 1 normalization iteration}
 \label{cor-1-full}
\end{table}

\begin{table}
\begin{tabular}{l|rrrrr}
& $R$ & $R^2$ & KL & JS & cos  \\ \hline
  Leaf-Ancestor&0.14 & 0.14 & 0.16 & 0.15 & 0.08\\
  Trigram&\textit{0.22*} & 0.17 & \textit{0.22*} & \textit{0.22*} & 0.16\\
  Leaf-Head&0.10 & 0.11 & 0.15 & 0.12 & 0.10\\
  Phrase-Structure Rules&0.14 & 0.10 & 0.14 & 0.15 & 0.06\\
  Phrase-Structure with Grandparents&0.16 & 0.14 & 0.14 & 0.15 & 0.05\\
  Unigram&0.12 & 0.11 & 0.14 & 0.13 & 0.17\\
  Dependencies, MaltParser trained by Timbl&0.09 & 0.12 & 0.16 & 0.11 & 0.11\\
  Arc-Head&0.08 & 0.10 & 0.14 & 0.10 & 0.09\\
  All Features Combined&0.19 & 0.16 & \textit{0.20*} & \textit{0.21*} & 0.11\\
\end{tabular}
 \caption{Geographic correlation for sample size 1000, 5
   normalizations}
 \label{cor-5-1000}
\end{table}

\begin{table}
\begin{tabular}{l|rrrrr}
& $R$ & $R^2$ & KL & JS & cos  \\ \hline
  Leaf-Ancestor&-0.14 & -0.16 & -0.15 & -0.15 & -0.08\\
  Trigram&-0.09 & -0.07 & -0.09 & -0.09 & -0.09\\
  Leaf-Head&-0.22 & -0.21 & -0.18 & -0.22 & -0.10\\
  Phrase-Structure Rules&-0.19 & -0.14 & -0.11 & -0.20 & -0.01\\
  Phrase-Structure with Grandparents&-0.17 & -0.11 & -0.09 & -0.18 & -0.02\\
  Unigram&-0.10 & -0.06 & -0.07 & -0.08 & 0.14\\
  Dependencies, MaltParser trained by Timbl&-0.19 & -0.18 & -0.18 & -0.19 & -0.10\\
  Arc-Head&-0.21 & -0.18 & -0.18 & -0.21 & -0.10\\
  All Features Combined&-0.18 & -0.18 & -0.16 & -0.18 & -0.09\\
\end{tabular}
 \caption{Geographic correlation for complete sites, 5 normalizations}
 \label{cor-5-full}
\end{table}

\begin{table}
\begin{tabular}{l|rrrrr}
& $R$ & $R^2$ & KL & JS & cos  \\ \hline
  Leaf-Ancestor&-0.03 & 0.02 & 0.01 & -0.04 & 0.07\\
  Trigram&0.20 & 0.19 & 0.11 & \textit{0.23*} & 0.14\\
  Leaf-Head&-0.07 & 0.01 & -0.01 & -0.08 & 0.05\\
  Phrase-Structure Rules&0.01 & \textit{0.18*} & 0.17 & 0.00 & 0.14\\
  Phrase-Structure with Grandparents&0.03 & \textit{0.26*} & 0.22* & 0.03 & 0.15\\
  Unigram&\textit{0.20*} & \textit{0.19*} & \textit{0.30**} & \textit{0.31**} & \textit{0.21*}\\
  Dependencies, MaltParser trained by Timbl&-0.08 & 0.02 & -0.01 & -0.09 & 0.05\\
  Arc-Head&-0.08 & 0.05 & -0.06 & -0.10 & 0.00\\
  All Features Combined&-0.03 & 0.03 & 0.01 & -0.03 & 0.06\\
\end{tabular}
 \caption{Travel correlation for sample size 1000, 1 normalization iteration}
 \label{travel-cor-1-1000}
\end{table}

\begin{table}
\begin{tabular}{l|rrrrr}
& $R$ & $R^2$ & KL & JS & cos  \\ \hline
  Leaf-Ancestor&0.02 & 0.08 & 0.11 & 0.00 & 0.08\\
  Trigram&\textit{0.31*} & \textit{0.28*} & \textit{0.32**} & 0.26* & 0.09\\
  Leaf-Head&-0.02 & 0.12 & 0.13 & -0.05 & 0.01\\
  Phrase-Structure Rules&0.15 & \textit{0.37**} & 0.32** & 0.13 & \textit{0.22*}\\
  Phrase-Structure with Grandparents&0.17 & 0.43** & 0.38** & 0.16 & 0.22*\\
  Unigram&\textit{0.22*} & \textit{0.22*} & \textit{0.33**} & \textit{0.34**} & \textit{0.24*}\\
  Dependencies, MaltParser trained by Timbl&-0.01 & 0.14 & 0.17 & -0.04 & 0.02\\
  Arc-Head&-0.06 & 0.12 & -0.02 & -0.12 & -0.03\\
  All Features Combined&0.04 & 0.10 & 0.16 & 0.01 & 0.04\\
\end{tabular}
 \caption{Travel correlation for complete sites, 1 normalization iteration}
 \label{travel-cor-1-full}
\end{table}

\begin{table}
\begin{tabular}{l|rrrrr}
& $R$ & $R^2$ & KL & JS & cos  \\ \hline
  Leaf-Ancestor&0.17 & 0.19* & 0.17* & 0.18 & 0.07\\
  Trigram&\textit{0.24*} & \textit{0.20*} & \textit{0.25*} & \textit{0.26*} & 0.16\\
  Leaf-Head&0.14 & 0.16 & 0.17 & 0.15 & 0.10\\
  Phrase-Structure Rules&0.17 & 0.14 & 0.16* & 0.18 & 0.06\\
  Phrase-Structure with Grandparents&0.19 & \textit{0.18*} & 0.17* & 0.19 & 0.06\\
  Unigram&0.15 & 0.13 & \textit{0.17*} & 0.16 & \textit{0.20*}\\
  Dependencies, MaltParser trained by Timbl&0.12 & 0.16 & 0.18 & 0.14 & 0.11\\
  Arc-Head&0.09 & 0.13 & 0.14 & 0.11 & 0.08\\
  All Features Combined&\textit{0.23*} & \textit{0.20*} & \textit{0.22*} & \textit{0.24*} & 0.11\\
\end{tabular}
 \caption{Travel correlation for sample size 1000, 5 normalizations}
 \label{travel-cor-5-1000}
\end{table}

\begin{table}
\begin{tabular}{l|rrrrr}
& $R$ & $R^2$ & KL & JS & cos  \\ \hline
  Leaf-Ancestor&-0.13 & -0.13 & -0.10 & -0.13 & -0.04\\
  Trigram&-0.06 & -0.04 & -0.05 & -0.06 & -0.05\\
  Leaf-Head&-0.20 & -0.17 & -0.13 & -0.19 & -0.06\\
  Phrase-Structure Rules&-0.15 & -0.08 & -0.05 & -0.15 & 0.04\\
  Phrase-Structure with Grandparents&-0.12 & -0.05 & -0.03 & -0.13 & 0.03\\
  Unigram&-0.07 & -0.03 & -0.04 & -0.05 & \textit{0.18*}\\
  Dependencies, MaltParser trained by Timbl&-0.18 & -0.15 & -0.12 & -0.18 & -0.05\\
  Arc-Head&-0.20 & -0.17 & -0.14 & -0.19 & -0.06\\
  All Features Combined&-0.16 & -0.14 & -0.11 & -0.15 & -0.05\\
\end{tabular}
 \caption{Travel correlation for complete sites, 5 normalizations}
 \label{travel-cor-5-full}
\end{table}

From tables \ref{cor-1-1000} -- \ref{travel-cor-5-full} we see that parameter settings that correlate
significantly do so at rates around 0.2 to 0.3, with a high of 0.37
for phrase-structure-rule features measured by $R^2$, 1 normalization
iteration and comparison of full sites.  The significant
correlations are mostly concentrated in the trigram, unigram and
combined feature sets.

\subsection{Analysis}

As with the number of significant distances, trigrams and unigrams are
the most likely to correlate with geographic and travel distance,
as well as the combined feature set for the 5-normalization parameter
setting.
% As before, a possible explanation is that unigrams are
% simpler, so the type count is a higher than for other measures. With
% more rounds of normalization, more correlations shift over to
% trigrams.
Note that in tables \ref{cor-1-1000} --
\ref{travel-cor-5-full}, the significant correlations are marked with
an asterisk, but only the italicized correlations are based on at
least 95\% significant distances. For example, this means that most of
the significant correlations based on phrase-structure rules are not valid.

It is worthwhile to note, however, that the valid and significant
correlations based on phrase-structure grammars give the highest
correlations: 0.37 for $R^2$ with full-site comparisons and 1 round
of normalization.
The addition of more data and more normalization is interesting in
expanding the correlating parameter settings beyond those that include
unigram features. It may be that this is an instance of the noise/quality tradeoff.
These additions appear to extract more detail from
the data, at the cost of additional interference from noisy data.

% Goes here: Fevered speculation about why travel correlation is *better* with
% the methods that correlate *less*, for 1-full at least.
% OK never mind this isn't true.

\subsection{Inter-measure Correlation}
\label{results-chapter-inter-measure-correlation}

Correlation between measures shows that they produce similar
results. It also suggests that they use similar information to do
so. For example, cosine similarity correlates the least with the
others, which means that its results are the least like the
others. It also implies that cosine similarity uses information from
input features differently than the other measures. Since the
performance of the summed, non-cosine measures is a little better for
this site size, practical use of this distance method should probably
start with them. In other computational linguistic applications,
cosine distance is typically used with larger corpora, so it is
possible that it provides better results with larger corpora, such as
corpora based on entire provinces of Sweden rather than the individual
villages used in this dissertation.

The average correlation between different measures is given in table
\ref{self-correlation-measures}. The correlations are averaged over
the correlations for all combinations of feature set with
1000-sentence samples and with non-significant correlations removed
before averaging.

\begin{table}
  \begin{tabular}{r|cccc}
    & $R^2$ & $KL$ & $JS$ & cos \\ \hline
    $R$ & 0.85 & 0.85 & 0.98 & 0.39\\
    $R^2$&& 0.90 & 0.83 & 0.57\\
    $KL$ &&& 0.88 & 0.67\\
    $JS$ &&&& 0.44
  \end{tabular}
  \caption{Average Inter-measure-correlation of measures}
  \label{self-correlation-measures}
\end{table}

The inter-measure correlation is essentially a summary of the
results from the significance testing and correlations. $R$ and
Jensen-Shannon produce nearly identical results, and also correlate
highly. Cosine similarity is quite different from the other measures,
though the correlation is still higher than with travel distance. This
is expected insofar as the cosine operation at the heart of cosine similarity
differs more from the sums of absolute values or logarithms of other
measures.

\subsection{Correlation with Corpus Size}

As previously stated, correlation with corpus size is not predicted and is probably
an undesired defect in sampling or normalization. Correlation with
corpus size is presented in tables \ref{size-cor-1-1000} --
\ref{size-cor-5-full}.

Corpus size between two sites can be measured in two different ways:
either by the sum of the sites' sizes, or by the difference. Here
the sum is used: a larger sum means more tokens. If there is a
correlation with size, it must arise because higher token counts are
not properly normalized. In other words, two large sites will
have more tokens, leading to higher type counts, which directly leads
to higher distances. Smaller sites will lead to lower distances.

\begin{table}
\begin{tabular}{l|rrrrr}
& $R$ & $R^2$ & KL & JS & cos  \\ \hline
  Leaf-Ancestor&-0.38 & -0.26 & -0.37 & -0.40 & -0.37\\
  Trigram&0.12 & -0.12 & -0.16 & 0.14 & -0.18\\
  Leaf-Head&-0.39 & -0.26 & -0.35 & -0.43 & -0.39\\
  Phrase-Structure Rules&0.06 & 0.15 & 0.00 & 0.03 & -0.10\\
  Phrase-Structure with Grandparents&0.08 & 0.19 & 0.07 & 0.04 & -0.09\\
  Unigram&-0.08 & -0.14 & -0.09 & -0.09 & -0.10\\
  Dependencies, MaltParser trained by Timbl&-0.35 & -0.23 & -0.28 & -0.37 & -0.37\\
  Arc-Head&-0.44 & -0.26 & -0.40 & -0.48 & -0.34\\
  All Features Combined&-0.37 & -0.26 & -0.38 & -0.42 & -0.40\\
\end{tabular}
\caption{Size correlation for sample size 1000, 1 normalization}
\label{size-cor-1-1000}
\end{table}

\begin{table}
\begin{tabular}{l|rrrrr}
& $R$ & $R^2$ & KL & JS & cos  \\ \hline
  Leaf-Ancestor&-0.19 & -0.15 & -0.16 & -0.24 & -0.36\\
  Trigram&\textit{0.30*} & 0.08 & 0.19 & 0.08 & -0.39\\
  Leaf-Head&-0.17 & -0.06 & -0.08 & -0.26 & -0.41\\
  Phrase-Structure Rules&0.52** & \textit{0.40**} & 0.30* & 0.47** & -0.21\\
  Phrase-Structure with Grandparents&0.54** & 0.43** & 0.37** & 0.50** & -0.22\\
  Unigram&-0.09 & -0.13 & -0.11 & -0.13 & -0.13\\
  Dependencies, MaltParser trained by Timbl&-0.08 & 0.02 & 0.09 & -0.14 & -0.39\\
  Arc-Head&-0.32 & -0.16 & -0.26 & -0.40 & -0.35\\
  All Features Combined&-0.15 & -0.11 & -0.10 & -0.25 & -0.42\\
\end{tabular}
\caption{Size correlation for complete sites, 1 normalization}
\label{size-cor-1-full}
\end{table}

\begin{table}
\begin{tabular}{l|rrrrr}
& $R$ & $R^2$ & KL & JS & cos  \\ \hline
  Leaf-Ancestor&\textit{0.35*} & 0.36** & 0.06 & 0.27 & -0.32\\
  Trigram&\textit{0.75**} & \textit{0.63**} & \textit{0.46**} & \textit{0.68**} & -0.24\\
  Leaf-Head&\textit{0.46**} & \textit{0.44**} & 0.14 & \textit{0.38**} & -0.33\\
  Phrase-Structure Rules&\textit{0.85**} & \textit{0.59**} & 0.36** & \textit{0.85**} & -0.34\\
  Phrase-Structure with Grandparents&\textit{0.88**} & \textit{0.66**} & 0.40** & \textit{0.88**} & -0.36\\
  Unigram&0.38** & 0.35** & 0.14 & 0.19 & -0.04\\
  Dependencies, MaltParser trained by Timbl&\textit{0.44**} & \textit{0.41**} & 0.16 & \textit{0.39*} & -0.30\\
  Arc-Head&0.20 & 0.28* & -0.00 & 0.09 & -0.28\\
  All Features Combined&\textit{0.58**} & \textit{0.48**} & 0.21 & \textit{0.47**} & -0.31\\
\end{tabular}
 \caption{Size correlation for sample size 1000, 5 normalizations}
 \label{size-cor-5-1000}
\end{table}

\begin{table}
\begin{tabular}{l|rrrrr}
& $R$ & $R^2$ & KL & JS & cos  \\ \hline
  Leaf-Ancestor&-0.55 & -0.38 & -0.26 & -0.53 & -0.17\\
  Trigram&-0.29 & -0.27 & -0.19 & -0.26 & -0.14\\
  Leaf-Head&-0.61 & -0.43 & -0.27 & -0.58 & -0.18\\
  Phrase-Structure Rules&-0.21 & -0.08 & -0.04 & -0.22 & -0.14\\
  Phrase-Structure with Grandparents&-0.24 & -0.08 & -0.03 & -0.26 & -0.14\\
  Unigram&-0.38 & -0.25 & -0.30 & -0.32 & -0.08\\
  Dependencies, MaltParser trained by Timbl&-0.52 & -0.33 & -0.20 & -0.51 & -0.15\\
  Arc-Head&-0.59 & -0.45 & -0.33 & -0.54 & -0.20\\
  All Features Combined&-0.61 & -0.44 & -0.26 & -0.55 & -0.18\\
\end{tabular}
\caption{Size correlation for complete sites, 5 normalizations}
\label{size-cor-5-full}
\end{table}

In tables \ref{size-cor-1-1000} and \ref{size-cor-1-full}, the
1-normalized correlations, only two correlations are
significant. However, in table \ref{size-cor-5-1000}, 5-normalized
correlations with 1000-sampling, a large number of correlations are
significant. Specifically, the highest performing measures, $R$, $R^2$,
and Jensen-Shannon divergence, correlate significantly with size for
nearly all feature sets. Since this is not a predicted correlation, it
means that these distances may be invalid. However, another piece of
evidence makes this conclusion uncertain: geographics distance also
correlates with corpus size at a rate of 0.31, $p < 0.01$, and travel
distance correlates at 0.32, $p < 0.01$. This correlation is also
unexpected, since there is no reason to expect that distance predicts
corpus size or vice versa. However, it shows that the size correlation of
dialect distance may at least by partly explained here by the
unexpected correlation with geographic and travel distance. Therefore,
5-normalised results will be presented throughout the rest of the
results.

\subsubsection{Analysis}

The correlation of corpus size and dialect distance is a problem. It
is not a predicted as a side effect of the way dialect distance is
measured. The fact that travel distance also correlates with corpus
size at a rate of 0.32 confuses the issue further. Is corpus size the
determining variable? Or is there an unknown variable influencing all
three? One possibility is ``interviewer boundaries'', common in
corpora collected by multiple people \cite{nerbonne03}.  Perhaps a
single interviewer improved with practice and collected longer interviews as
the interview collection progressed. Or perhaps cultural differences
between the interviewer and interviewees caused some participants in
one area to talk more than in another area.

Although the size correlation of the dialect distances may be
explained by the correlation with geographic/travel distance, they are still
somewhat worrying. There is a great enough difference above the
correlation of corpus size and geographic/travel distance that 5-normalized
distances might not be reliable.
However, if 5-normalization introduces a dependency on corpus size,
then the distances from full-corpus comparisons should correlate even
more highly. This is not the case.
% It appears that multiple rounds of
% normalization inadvertently re-introduce a dependency on size.
% TODO: This probably IS a bug in that only Freq norm can be
% iterated. Ratio norm should probably be in a separate loop like so:
% #ifdef RATIO_NORM
%   for(sample::iterator i = ab.begin(); i!=ab.end(); i++) {
%     i->second.first *= 2 * types / tokens;
%     i->second.second *= 2 * types / tokens;
%   }
% #endif

% TODO: I also should write this up when I have time
Alternatively, it is possible that the fixed-size sampling method is not
properly eliminating size differences between interview sites. Future work
should develop a method for normalizing a comparison between two full
sites. It should avoid sampling, but also take the relative number
of sentences into account.

\section{Clusters}
\label{section-clusters}

Cluster dendrograms provide a visualization of which sites the
distance measures find most similar. They are formed in a bottom-up
manner, repeatedly merging the two most similar groups at each step
until only one group remains. The resulting dendrogram usually has
obvious sub-trees which can be treated as clusters. By grouping sites
into clusters, cluster dendrograms allow closer comparison to
dialectology than correlation. These clusters can be compared to the
regions proposed by syntactic dialectology.

The first two dendrograms in this section hold feature set, measure,
and sample size constant at trigrams, Jensen-Shannon, and
1000-sentence samples, respectively. Then they vary the amount of
normalization: figure \ref{cluster-1-js-trigram} has 1 normalization
round, while figure \ref{cluster-5-js-trigram} has 5. These two examples
were chosen because of their high numbers of significant
distances and correlation with travel distance; the highest
correlation of 5-normalized distances with travel distance, 0.26, is
with the Jensen-Shannon measure and trigram features in figure
\ref{cluster-5-js-trigram}.

The third figure, figure \ref{cluster-1-r_sq-psg}, gives the dendrogram
for the parameter settings with the highest travel distance
correlation, phrase-structure rules, 1 normalization, 1000-sentence
samples, and $R^2$ measure. The highest correlation of 1-normalized
distances with travel distance, 0.37, is given by $R^2$ measured over
phrase-structure-rule features, comparing full sites. Those parameter
settings produce the dendrogram in figure
\ref{cluster-1-r_sq-psg}.

% Within the same settings for sampling and number of normalization
% iterations, the clusters based on sentence-length normalization alone are fairly
% similar, regardless of measure and feature set. Changing the sampling
% settings or the number of normalizations substantial reconfiguration.

% For example, the clusters produced by $R$ (figure
% \ref{cluster-1-r-trigram}) and Jensen-Shannon divergence are fairly
% similar (figure \ref{cluster-1-js-trigram}). Both are based on trigram
% features with sentence-length normalization only. Those dendrograms
% differ from their 5-normalized equivalents, figures
% \ref{cluster-5-r-trigram} and \ref{cluster-5-js-trigram}.

% \begin{figure}
%   \includegraphics[width=0.9\textwidth]{dist-1-1000-r-trigram-ratio-clusterward}
%  \caption{Dendrogram With $R$
%     measure and trigram features, 1 normalization, 1000 samples}
%   \label{cluster-1-r-trigram}
% \end{figure}

% TODO: Remove R.app's captions in favour of mine.
% TODO: Remove R.app's x-scale (y-scale) too

\begin{figure}
  \includegraphics[width=0.9\textwidth]{dist-1-1000-js-trigram-ratio-clusterward}
 \caption{Dendrogram With Jensen-Shannon
    measure and trigram features, 1 normalization, 1000 samples}
  \label{cluster-1-js-trigram}
\end{figure}

\begin{figure}
  \includegraphics[width=0.9\textwidth]{dist-5-1000-js-trigram-ratio-clusterward}
 \caption{Dendrogram With Jensen-Shannon
    measure and trigram features, 5 normalizations, 1000 samples}
  \label{cluster-5-js-trigram}
\end{figure}


\begin{figure}
  \includegraphics[width=0.9\textwidth]{dist-1-full-r_sq-psg-ratio-clusterward}
 \caption{Dendrogram With $R^2$ measure and phrase-structure-rule features,
 1 normalization, complete sites}
  \label{cluster-1-r_sq-psg}
\end{figure}


% \begin{figure}
%   \includegraphics[width=0.9\textwidth]{dist-5-1000-r-trigram-ratio-clusterward}
%  \caption{Dendrogram With $R$ measure and trigram features, 5 normalizations, 1000 samples}
%   \label{cluster-5-r-trigram}
% \end{figure}


Unlike the significances, cosine similarity's dendrograms are fairly
similar to those of other features. See for example figure
\ref{cluster-5-cos-trigram}, with cosine, trigram features and
5 iterations of normalization.

However, it is difficult to judge the amount of agreement between
these individual dendrograms. These figures are mostly given as
examples rather than for in-depth comparison. Instead of manually
comparing each to the dialect regions of Sweden, a better option is to
aggregrate them automatically into a single dendrogram, retaining only
the clusters that agree. This is a consensus tree.

\begin{figure}
 \includegraphics[width=0.9\textwidth]{dist-5-1000-cos-trigram-ratio-clusterward}
 \caption{Dendrogram with cosine measure and trigram features, 5
   normalizations}
  \label{cluster-5-cos-trigram}
\end{figure}


\subsection{Consensus Trees}
\label{section-consensus}

Consensus trees combine the results of cluster dendrograms, retaining
only clusters that occur in the majority of dendrograms. When
dendrograms have high agreement, the resulting consensus tree will
retain most of the detail. When dendrograms have low agreement, the
resulting consensus tree will be fairly flat.  This avoids the
dendrograms' problem of instability, where small changes in distances
cause large re-arrangements in the tree. Only dendrograms whose input
distances were at least 95\% significant were used. That is, a
measure/feature set combination had to be non-bold in tables
\ref{sig-1-1000} to \ref{sig-5-full} to be included. The consensus
tree for full-site comparisons and 5 rounds of normalization is not
given because there is only one dendrogram that qualifies.

It's worthwhile to note that more dendrograms were used to build
the consensus tree of figure \ref{consensus-5-1000} than were used in
figures \ref{consensus-1-1000} and \ref{consensus-1-full}. Despite this, figure
\ref{consensus-5-1000} retains much more detail, indicating that its
constituent dendrograms, based on 5 rounds of normalization,
agree more than those with only 1 round of normalization.

The consensus trees are also grouped into clusters, which are then
mapped in figures \ref{map-consensus-1-1000} --
\ref{map-consensus-5-1000}. The outline map of Sweden were provided by
Therese Leinonen and are the same as those in \namecite{leinonen08}. The
L04 package from the University of Groningen was used to map the
consensus trees onto the map of Sweden; the multi-dimensional scaling
maps and composite cluster maps also used L04.
% TODO: CITE this, I think it's a Pieter Klieweg paper


\begin{figure}
\includegraphics[scale=0.7]{consensus-1-1000}
% \Tree[. {Villberga\\Viby\\Vaxtorp\\Torso\\Tors\aa{}s\\StAnna\\Sproge\\Sorunda\\Skinnskatteberg\\Segerstad\\Ossjo\\Orust\\Norra Rorum\\Loderup\\Leksand\\K\"ola\\Jamshog\\Indal\\Frillesas\\Fole\\Faro\\Bredsatra\\Boda\\Bara\\Asby\\Arsunda\\Anundsjo\\Ankarsrum} [. {Floby\\Bengtsfors}  ] ]
\caption{Consensus Tree for 1000-samples and 1 normalization}
\label{consensus-1-1000}
\end{figure}

\begin{figure}
\includegraphics[scale=0.7]{consensus-1-full}
% \Tree[. {Villberga\\Viby\\Torso\\Tors\aa{}s\\Sorunda\\Segerstad\\Ossjo\\Orust\\Norra Rorum\\Loderup\\Leksand\\K\"ola\\Indal\\Fole\\Boda\\Bara\\Asby\\Arsunda\\Anundsjo\\Ankarsrum}
%   [. {Vaxtorp\\Skinnskatteberg}  ]
%   [. {StAnna\\Frillesas}  ]
%   [. {Sproge\\Faro}  ]
%   [. {Jamshog\\Bredsatra}  ]
%   [. {Floby\\Bengtsfors}  ] ]
\caption{Consensus Tree for full site comparison and 1 normalization}
\label{consensus-1-full}
\end{figure}

\begin{figure}
\includegraphics[scale=0.7]{consensus-5-1000}
% \Tree[. {Villberga\\Viby\\Vaxtorp\\Torso\\StAnna\\Sproge\\Sorunda\\Skinnskatteberg\\Segerstad\\Orust\\Norra Rorum\\Leksand\\K\"ola\\Indal\\Frillesas\\Fole\\Floby\\Faro\\Boda\\Bengtsfors\\Bara\\Asby\\Arsunda\\Anundsjo\\Ankarsrum} [. {Loderup\\Bredsatra}  ] [. {Tors\aa{}s\\Ossjo\\Jamshog}  ] ]
\caption{Consensus Tree for 1000-samples and 5 normalizations}
\label{consensus-5-1000}
\end{figure}

\begin{figure}
\includegraphics[scale=0.85]{Sverigekarta-Landskap-consensus-1-1000}
\caption{Consensus Tree for 1000-samples and 1 normalization, Mapped}
\label{map-consensus-1-1000}
\end{figure}

\begin{figure}
\includegraphics[scale=0.85]{Sverigekarta-Landskap-consensus-1-full}
\caption{Consensus Tree for full site comparison and 1 normalization, Mapped}
\label{map-consensus-1-full}
\end{figure}

\begin{figure}
\includegraphics[scale=0.85]{Sverigekarta-Landskap-consensus-5-1000}
\caption{Consensus Tree for 1000-samples and 5 normalizations, Mapped}
\label{map-consensus-5-1000}
\end{figure}

% It would still be cool to eliminate only the non-significant distances
% and re-run the clusters. (I can't remember if that's easily possible
% with R though, it may only be a feature of MDS.)

\subsubsection{Analysis}

The cluster dendrograms are dangerous to interpret too closely on
their own; the instability of a single dendrogram means that small
clusters cannot be analyzed reliably. For example, in figure
\ref{cluster-5-js-trigram}, a two-way split between the sites on the
top and bottom of the page is obvious, and another in the top cluster
is easy to argue for, but outliers like Anundsj\"o and \AA{}rsunda are
likely to shift from group to group in other dendrograms.

It is safer to analyze the consensus trees; the smoothing effect of
taking the majority rule of each cluster will show where the optimal
cutoff for splitting clusters is removing spurious detail. The three
consensus trees in figures \ref{consensus-1-1000} --
\ref{consensus-5-1000} vary in amount of detail but the trees with
more clusters do not contradict the clusters of the flatter trees.

For 1000-sentence samples and 1 round of normalization, there is one
cluster: Floby and Bengtsfors. Full-site comparison finds
another cluster: J\"amshog, \"Ossj\"o and Tors\aa{}s. Finally,
1000-sentence samples and 5 rounds of normalization finds another
cluster consisting of L\"oderup and Breds\"atra. It also finds
a large two-way split between the sites and adds Sproge to the first
cluster with Floby and Bengtsfors. To aid further analysis, the
clusters are assigned colors, which are detailed in figures
\ref{blue-cluster} -- \ref{orange-cluster}. 

\begin{figure}
\begin{itemize}
\item Floby
\item Bengtsfors
\item Sproge (for 1000-sample, 5-normalization)
\end{itemize}
\caption{Blue Cluster}
\label{blue-cluster}
\end{figure}

\begin{figure}
\begin{itemize}
\item J\"amsh\"og
\item Tors\aa{}s
\item \"Ossj\"o
\end{itemize}
\caption{Red Cluster}
\label{red-cluster}
\end{figure}

\begin{figure}
\begin{itemize}
\item Breds\"atra
\item L\"oderup
\end{itemize}
\caption{Yellow Cluster}
\label{yellow-cluster}
\end{figure}

\begin{figure}
\begin{itemize}
\item Leksand
\item Indal
\item Segerstad
\item Floby
\item Bengtsfors
\item Sproge
\item Skinnskatteberg
\item Orust
\item V\aa{}xtorp
\item F\aa{}r\"o
\item Asby
\item \AA{}rsunda
\item Anundsj\"o
\item Ankarsrum
\item Fole
\end{itemize}
\caption{Cyan Cluster}
\label{cyan-cluster}
\end{figure}

\begin{figure}
\begin{itemize}
\item Viby
\item Bara
\item S:t Anna
\item Frilles\aa{}s
\item J\"amshog
\item Tors\aa{}s
\item \"Ossj\"o
\item K\"ola
\item L\"oderup
\item Breds\"atra
\item Villberga
\item Tors\"o
\item Norra R\"orum
\item Sorunda
\item B\"oda
\end{itemize}
\caption{Orange Cluster}
\label{orange-cluster}
\end{figure}

When these clusters are mapped onto the geography of Sweden, some
patterns are visible. Since figure \ref{consensus-5-1000} is strictly
more complex than the preceding two, it is used as the basis for this
analysis--see figure \ref{map-consensus-5-1000}. The large two-way split
is between the orange and cyan clusters. The orange cluster, which
includes red and yellow clusters, forms two horizontal bands across
Sweden. The centers of the orange cluster appear to be Stockholm and
Malm\"o. Meanwhile, the red and yellow clusters form a boundary along
the northern border of Sk\aa{}ne and Blekinge counties.

Meanwhile, the cyan cluster, which includes the blue cluster, seems to
represent the countryside of Sweden. On the other hand, because the
blue cluster is near G\"oteborg, it might be better characterized
simply as ``non-Stockholm''. This matches the traditional dialect
regions of Sweden, with the exception of of the city/country divide,
and the fact that this hard clustering simplifies the dialect
boundaries, which are traditionally believed to be gradient. Also, the
island Gotland is not put in a separate cluster as predicted by
traditional boundaries. For discussion, see section
\ref{discussion-chapter-dialectology-section}.


\subsection{Composite Cluster Maps}
\label{section-composite-cluster}

Composite cluster maps use an underlying technique similar to
consensus trees--cluster dendrograms, but they combine and present the
information in a very different way. They, too, provide a stabler view
of the groups that sites form when clustered. This view, however,
emphasizes the boundaries between sites. The result looks
much more like the traditional isogloss boundaries of
dialectology.

The two composite cluster maps in figures \ref{map-composite-1-1000}
-- \ref{map-composite-5-1000} are the composite of the same
dendrograms used as input for the consensus trees: all-significant
parameter settings, divided by type of normalization (sentence-length
only or ratio added as well).

\begin{figure}
\includegraphics[scale=0.82]{Sverigekarta-cluster-1-1000}
\caption{Composite Cluster Map for 1000-sample, 1 normalization}
\label{map-composite-1-1000}
\end{figure}

\begin{figure}
\includegraphics[scale=0.82]{Sverigekarta-cluster-1-full}
\caption{Composite Cluster Map for complete sites, 1 normalization}
\label{map-composite-1-full}
\end{figure}

\begin{figure}
\includegraphics[scale=0.82]{Sverigekarta-cluster-5-1000}
\caption{Composite Cluster Map for complete sites, 5 normalizations}
\label{map-composite-5-1000}
\end{figure}

All three composite clusters maps provide a picture similar to the
consensus tree map \ref{map-consensus-5-1000} of the previous
section. The north-to-south gradient is supported by the
weak horizontal boundaries present up and down Sweden.

Of these boundaries, the one between Sk\aa{}ne and the rest of Sweden is
the strongest. Due to the lack of interview sites in the middle of
south Sweden, the boundary is drawn further north than it
traditionally appears, but this is an effect of the software that
produced the figure. Notice that there is also a boundary between the
red cluster, comprised of J\"amshog, Tors\aa{}s, and \"Ossj\"o, and the
other sites, especially visible in figures \ref{map-composite-1-1000} and
\ref{map-composite-5-1000}. Their presence along the northern border
of Sk\aa{}ne is one reason why its boundary with the rest of Sweden is
so strong.

\begin{sloppypar}
  Compared to the consensus tree maps, the composite cluster maps
  cannot support the city/country distinction because there is no way
  to identify distant areas by their color. On the other hand, it is
  possible to detect the relative strength of a boundary. To combine
  these two features, multi-dimensional scaling is needed.
\end{sloppypar}
% But of course MDS maps can't be combined into a consensus...

% However, K\"ola and Frilles\aa{}s still separate fairly well from their
% neighbors. These sites are on the edges of the country and have strong borders
% with surrounding, Like the cluster J\"amshog, Tors\aa{}s and \"Ossj\"o,
% these sites are different from the others. However, they don't have
% any geographic coherence, so it is more likely these are remnants of a
% dialect that was historically wider spread and has since receded.


\section{Multi-Dimensional Scaling}
\label{section-mds}

\begin{sloppypar}
  Multi-dimensional scaling (MDS) plays a similar role to clusters,
  condensing the high-dimensional information into an form that is
  easier to understand. It differs, however, in producing gradient
  numbers, not binary trees: cluster dendrograms put each site into
  one and only one cluster, whereas MDS puts each site into 3D space;
  the clusters are only implicit in the positions.  This also means
  that MDS maps are more stable than dendrograms.
\end{sloppypar}

\quotecite{kruskal64b} MDS works by positioning the dissimilarities in
high-dimensional space, then converting them to true distances in some
lower dimensional space---in this case, three dimensions. It distorts
the dissimilarities equally and by the minimum amount
necessary. Kruskal calls the measure of distortion Stress. Once the
sites are points in 3D space, each sites' x, y, and z co-ordinates can
be mapped onto the colors red, green, and blue, then drawn on a map of
Sweden.

It must be noted that the maps vary in color because of the way that
MDS positions the sites in 3D space, based on the distances between
them. Kruskal's method guarantees that its results are comparable for
equivalent inputs, but this may not always be obvious because the
color equivalence may be difficult to decipher. Equivalent MDS maps
may be rotated with respect to each other in 3D space, and this
rotation is visible in the color selection: if two sites are both blue
in one map and in another map are both orange, then they have the same
relation to each other.


The maps shown here in figures \ref{mds-1-1000-js-trigram} --
\ref{mds-1-full-r_sq-psg} are based on the same parameter settings as the
dendrograms in figures \ref{cluster-1-js-trigram} --
\ref{cluster-1-r_sq-psg}. The first two are Jensen-Shannon divergence
measured over trigrams, with 1 and 5 rounds of normalization,
respectively. The third is $R^2$ measured over phrase-structure rules
with 1 round of normalization.
% 1st is not used
% 3rd is not used
% 5th is not used

% \begin{figure}
% \includegraphics[scale=0.82]{Sverigekarta-mds-1-1000-r-trigram-ratio}
% \caption{$R$ measure with trigram features, 1000-sentence sampling and
%   1 round of normalization}
% \label{mds-1-1000-r-trigram}
% \end{figure}

\begin{figure}
\includegraphics[scale=0.82]{Sverigekarta-mds-1-1000-js-trigram-ratio}
\caption{Jensen-Shannon measure with trigram features, 1000-sentence sampling and
  1 round of normalization}
\label{mds-1-1000-js-trigram}
\end{figure}

% \begin{figure}
% \includegraphics[scale=0.82]{Sverigekarta-mds-1-full-r-trigram-ratio}
% \caption{$R$ measure with trigram features, full-site comparison and
%   1 round of normalization}
% \label{mds-1-full-r-trigram}
% \end{figure}

% \begin{figure}
% \includegraphics[scale=0.82]{Sverigekarta-mds-5-1000-r-trigram-ratio}
% \caption{$R$ measure with trigram features, 1000-sentence sampling and
%   5 rounds of normalization}
% \label{mds-5-1000-r-trigram}
% \end{figure}

\begin{figure}
\includegraphics[scale=0.82]{Sverigekarta-mds-5-1000-js-trigram-ratio}
\caption{Jensen-Shannon measure with trigram features, 1000-sentence sampling and
  5 rounds of normalization}
\label{mds-5-1000-js-trigram}
\end{figure}

\begin{figure}
\includegraphics[scale=0.82]{Sverigekarta-mds-1-full-r_sq-psg-ratio}
\caption{$R^2$ measure with phrase-structure-rule features, full-site comparison and
  1 round of normalization}
\label{mds-1-full-r_sq-psg}
\end{figure}

Despite the differences between MDS and the preceding methods, the
similar results are evident; the maps (figures
\ref{mds-1-1000-js-trigram} -- \ref{mds-1-full-r_sq-psg}) all show
the same patterns as the other methods. That is, there is a general
north-to-south gradience, especially easy to see in map
\ref{mds-1-1000-js-trigram}. There is a strong southern cluster,
visible in all of the diagrams. And there is a general two-way
distinction between city and country.

The main contribution that the MDS maps make is that the
north-to-south gradient is more obviously gradient. In other words, it
is easier to see the gradation from north to south. For example, in
figure \ref{mds-1-full-r_sq-psg}, looking from the north to south, the
colors change quickly close to Stockholm, then fade to green further
south, then transition back to blues and purples further south, in
Sk\aa{}ne.

The Stockholm and Malm\"o areas, which are in the same cluster in the
consensus tree maps, are here seen to be similar without being
identical. For example, in figure \ref{mds-5-1000-js-trigram}, the
Stockholm area is a shade of blue-green while the Malm\"o area is a
shade of blue-grey. Also in figure \ref{mds-5-1000-js-trigram},
Sk\aa{}ne and Blekinge are grey: clearly similar but not identical to
Malm\"o.

\section{Features}
\label{section-features}

Ranked features answer the question of agreement with dialectology
more precisely than the previous two methods. Features are ranked by
their normalized weight, which tells how much weight the distance
measure will give it. This can reveal aggregate differences that may
only be noticeable when counting a large amount of data. Conversely,
with different normalization settings, feature ranking can also point
out rare features that only occur in one site. The first kind of
features are unlikely to be noticed by linguists without the aid of
computers, whereas the second kind are the rare features that are easy
for linguists to notice.

Both kinds of normalizations are shown below; in the first set of
rankings, the normalizations for sentence size are applied, as
described in section \ref{normalization}, whereas the second set of
ranking is normalized for relative overuse, based on
\quotecite{wiersma09} normalization described in
\section{wiersma-normalization}. The overuse normalization shows which
features are used relatively more when comparing two sites. It
normalizes for feature frequency between two sites.

Without the overuse normalization, the top-ranked features will tend to
be the most common ones, those found in almost every sentence in the
interview. These common features tend to highlight gradient
differences: differences in quantity but not in quality. In contrast,
the overuse normalization allows us to see which features happen only
a few times in one side of the comparison and not at all in the
other. This is closer to a traditional linguistic analysis.

In addition, only features that appear in both groups being compared
were ranked; although features that only appear in one or the other
can be interesting, they tend to be noisy in features extracted from
automatically annotated corpora. It is not possible to tell which
unique features are interesting and which are noise, especially when
using the overuse normalization, which makes rarely occurring features
rank similarly to common ones.

These results compare clusters from the consensus trees
based on 1 round of normalization (figures \ref{consensus-1-1000} and
\ref{consensus-1-full}) as well as the consensus tree based on 5
rounds of normalization and a 1000-sentence sample (figure
\ref{consensus-5-1000}). The consensus tree for 5 rounds of
normalization and full-site comparisons only had one tree for input
and was not usable. Given these three consensus trees, the groups in
table \ref{feature-ranking-clusters} are the relevant ones for analysis.

There are four clusters, three small and one large which contains the
remainder of the sites. They are listed in table
\ref{feature-ranking-clusters}. Cluster A, containing Floby and
Bengtsfors, appears in all three consensus trees. Its features are
colored blue in the following figures. Cluster B, containing Jamshog,
Torsas and Ossjo, appears in the second two trees. It features are
colored red. Cluster C, containing Loderup and Bredsatra, appears only
in the third tree. Its features are colored yellow. The remainder of
the sites are in Cluster D; the third consensus tree differs from the
first two in splitting the remainder into two groups, but this
division is ignored here to reduce the number of comparisons. Between
large groups of sites, such comparisons are unlikely to be informative
anyway.

\begin{table}
  \begin{enumerate}
   \item[A] (Blue) Floby, Bengtsfors
    \item[B] (Red) J\"amshog, \"Ossj\"o, Tors\aa{}s
    \item[C] (Yellow) L\"oderup, Breds\"atra
    \item[D] (Cyan) Segerstad, K\"ola, S:t Anna, Sorunda, Norra Rorum,
      Villberga, Torso, Boda, Frilles\aa{}s, Indal, Leksand, Anundsj\"o,
      \AA{}rsunda, Asby, Orust, V\aa{}xtorp, Fole, Sproge, F\aa{}r\"o,
      Ankarsrum, Skinnskatteberg
  \end{enumerate}
  \caption{Clusters discussed}
  \label{feature-ranking-clusters}
\end{table}

For each pair of clusters, I rank and analyze the input features by
comparing feature differences. The features presented here are the ten
highest ranked features for a particular comparison. Although each
feature set has ten features ranked here, they are better thought of
as two sets of five features differences. The top five positive
features are shown as are the top five negative features, scaled such
that the most important feature has the value 1.0.

This has two advantages. It splits the features so that both the
positive and negative evidence are always visible; otherwise, in some
cases, if one side is strong enough, the other would be pushed out of
the top ten. However, it still allows the relative weight of evidence
to be estimated. For example, if some cluster has some idiosyncratic
features, most of the features will be positive, meaning that features
typical of that cluster contribute most to the distance between it and
other clusters. The two-part feature will show this: the five positive
features will have much higher values than the five negative features.


The first subsection, \ref{feature-ranking-complete}, shows all
comparisons between clusters for a single parameter setting: trigram
features, 1000-sentence sampling and sentence-size normalization
only. Besides unigrams, these are the parameters that give the highest
correlation with travel distance for 1000-sentence sampling.
In the next subsection, \ref{feature-ranking-overuse}, the overuse
normalization is added, keeping other parameter settings the same.
The third subsection, \ref{feature-ranking-feature-sets},
a single comparison between cluster A and cluster B is given for
all feature sets.
In the final subsection, \ref{feature-ranking-psg}, the high-ranked
phrase-structure rules are given.
The parts of speech for the features are given in table
\ref{pos-list}. The non-terminals are given in table
\ref{nonterminal-list}.

\begin{table}
\begin{tabular}{c|c||c|c}
  POS & Part of speech & POS & Part of speech \\ \hline
 $++$ & coordinating conjunction&MV & verb ``m\aa{}ste'' (must) \\
 AB & adverb&NN & noun \\
 AJ & adjective&PN & proper name \\
 AN & adjectival noun&PO & pronoun \\
 AV & verb ``vara'' (be)&PR & preposition \\
 BV & verb ``bli(va)'' (become) &QV & verb ``kunna'' (can) \\
EN & indefinite article&RO & numeral \\
 FV & verb ``f\aa{}'' (get) &SP & present participle \\
 GV & verb ``g\"ora'' (do) &SV & verb ``skola'' (shall) \\
 HV & verb ``hava'' (have) &UK & subordinating conjunction \\
 I? & question mark &VN & verbal noun \\
 ID & idiom &VV & other verb \\
 IM & infinitive marker &WV & verb ``vilja'' (want) \\
 IP & period &XX & Unclassifiable \\
MN & meta-noun &YY & Interjection \\ \hline
\end{tabular}
\vspace{5mm}
\caption{List of parts of speech}
\label{pos-list}
\end{table}

\begin{table}
  \begin{tabular}{c|c}
    NT & Non-terminal \\ \hline
     $+$F & Coordination at main clause level \\
     AA & other adverbial \\
     AP & adjective phrase \\
     AVP & adverb phrase \\
     CAP & Coordinated adjective phrase \\
     CNP & Coordinated noun phrase \\
     CONJP & Other coordinated phrase \\
     CS & Coordinated S \\
     ET & Other nominal post-modifier \\
     MS & Macrosyntagm \\
     NAC & Not a constituent \\
     NP & Noun phrase \\
     OA & Object adverbial \\
     PP & Prepositional phrase \\
     RA & Place adverbial \\
     S & Sentence \\
     SS & Other subject \\
     TA & Time adverbial \\ \hline
   \end{tabular}
   \vspace{5mm}
    \caption{List of non-terminal labels}
    \label{nonterminal-list}
\end{table}

\subsection{Trigram Features}
\label{feature-ranking-complete}

The analysis will start with trigram features without the overuse
normalization, since trigrams have the highest rate of significance of
the non-combined feature sets. (The combined feature set is not
presented because the mixed feature types make it difficult to
interpret.)

As mentioned above, the top-ranked trigrams are common, typical of the
core of the sentence. The trigrams typical of cluster A, are formed
most often, for example, from a trigram like \textit{ ``och det
  \"ar''} ``and it's'', followed by an adverb such as
\textit{``v\"al''} ``well''. Figures
\ref{clusterA-clusterB-feat-5-1000-trigram-ratio} --
\ref{clusterA-clusterD-feat-5-1000-trigram-ratio} generalize this
observation by the high rankings of the POS trigrams PO-AV-AB
(pronoun-copula-adverb, figure
\ref{clusterA-clusterB-feat-5-1000-trigram-ratio}), $++$-PO-AV
(conjunction-pronoun-copula, figure
\ref{clusterA-clusterB-feat-5-1000-trigram-ratio}), and PO-VV-AB
(pronoun-verb-adverb, figure
\ref{clusterA-clusterD-feat-5-1000-trigram-ratio}) The same is true of
the other clusters for the most part. Unfortunately, this makes it
hard to say interesting things about the difference in feature
distribution. It does appear that clusters B and C use adverbs and of
conjunctions that differ from the other clusters; for example
$++$-AB-AV in in figure
\ref{clusterA-clusterC-feat-5-1000-trigram-ratio}. The comparison
between cluster A and cluster B highlights the trigram AB-AB-AB
(figure \ref{clusterA-clusterB-feat-5-1000-trigram-ratio}) as
important, but more interesting are the $++$-AB-AV
(conjunction-copula-adverb) and AB-AV-AB (adverb-copula-adverb)
trigrams in the bottom halves of figures
\ref{clusterA-clusterB-feat-5-1000-trigram-ratio} and
\ref{clusterA-clusterC-feat-5-1000-trigram-ratio}. These trigrams
derive from sequences like \textit{``och s\aa{} \"ar''} (``and is
so'') and \textit{``inte \"ar ju''} (``is not now''), which do
not reflect standard Swedish grammar.

% TODO: Get a whole example sentence probably. Ugh. People want so
% much context!
% TODO: Also format these examples properly.

\begin{figure}
  \includegraphics[scale=1.2]{clusterA-clusterB-feat-5-1000-trigram-ratio}
  \caption{cluster A $\Leftrightarrow$ cluster B, trigram features}
  \label{clusterA-clusterB-feat-5-1000-trigram-ratio}
\end{figure}
\begin{figure}
  \includegraphics[scale=1.2]{clusterA-clusterC-feat-5-1000-trigram-ratio}
  \caption{cluster A $\Leftrightarrow$ cluster C, trigram features}
  \label{clusterA-clusterC-feat-5-1000-trigram-ratio}
\end{figure}
\begin{figure}
  \includegraphics[scale=1.2]{clusterA-clusterD-feat-5-1000-trigram-ratio}
  \caption{cluster A $\Leftrightarrow$ cluster D, trigram features}
  \label{clusterA-clusterD-feat-5-1000-trigram-ratio}
\end{figure}


\begin{figure}
  \includegraphics[scale=1.2]{clusterB-clusterC-feat-5-1000-trigram-ratio}
  \caption{cluster B $\Leftrightarrow$ cluster C, trigram features}
\end{figure}
\begin{figure}
  \includegraphics[scale=1.2]{clusterB-clusterD-feat-5-1000-trigram-ratio}
  \caption{cluster B $\Leftrightarrow$ cluster D, trigram features}
\end{figure}
\begin{figure}
  \includegraphics[scale=1.2]{clusterC-clusterD-feat-5-1000-trigram-ratio}
  \caption{cluster C $\Leftrightarrow$ cluster D, trigram features}
\end{figure}

\subsection{Trigrams with Overuse Normalization}
\label{feature-ranking-overuse}

Given this lack of information, there are two dimensions along which
the comparisons can be altered: normalization and feature
set. Starting with normalization, let us add the overuse normalization
technique. Differences appear immediately. First, the balance of
feature weight obviously differs here. For example, in the comparison
between cluster A and cluster B (figure
\ref{clusterA-clusterB-feat-5-1000-trigram-over}), the features of
cluster A are more important in distinguishing the two than the
features of cluster B. The comparison between cluster A and cluster D (figure
\ref{clusterA-clusterD-feat-5-1000-trigram-over})
is so lop-sided that cluster D contributes no features at all. This
occurs when none of the features the two clusters share are overused
in cluster D.

With the overuse normalization, cluster A has two interesting
patterns. First, the trigrams it overuses are filled with indefinite
articles (EN) and prepositions (PR). Examples include VV-EN-AB
(verb-indefinite-adverb), PR-EN-AB (preposition-indefinite-adverb) and
PR-EN-VN (preposition-indefinite-verbal noun) in figure
\ref{clusterA-clusterB-feat-5-1000-trigram-over}, as well as IM-PR-NN
(infinitive marker-preposition-noun) and PR-ID-PR
(preposition-idiom-preposition) in figure
\ref{clusterA-clusterD-feat-5-1000-trigram-over}. 
The sequence  \textit{``arrr''}
Second, the trigrams
it underuses mostly end with pronouns: 4 of 5 trigrams in the
comparison with cluster B (figure
\ref{clusterA-clusterB-feat-5-1000-trigram-over}) and 4 or 5 in the
comparison with cluster C (figure
\ref{clusterA-clusterC-feat-5-1000-trigram-over}). Even in the
comparison with cluster D (figure
\ref{clusterA-clusterD-feat-5-1000-trigram-over}), 4 of 5 of the
``least overused'' trigrams end with pronouns. (The low values in the
bottom half of the comparison with cluster D are not underused by
cluster A, because cluster D has no unique features here. Instead they
are the ``least overused'' by cluster A.)

Cluster B shows one interesting pattern: overuse of sk\"ola (shall),
including an interesting trigram SV-QV-AB (shall verb-can verb-adverb)
in figure \ref{clusterB-clusterD-feat-5-1000-trigram-over}. Although
this could be a mistake on the part of the tagger, the different forms
of this verb are limited, so this is unlikely: identifying them is not
hard. An example utterance with this pattern is \textit{``F\"or det
  var \ldots''} ``For it was that one \ldots''. Here the construction
\textit{``skulla kunna''} appears to be a double modal, similar to the
English double modal ``should can''.
% TODO: Get the full sentence either from jones or flenser
%% a quick search suggests that Fennell and Butters (1996) finds
%% evidence in German and Scandinavian languages...but it's a book ro
%% something. Google Scholar has no link, just a wimpy citation.
%% Also:
%% Modals and double modals in the Scandinavian languages
%% Working papers in Scandinavian syntax
%% Thrainsson and Vikner 1995 (but focussing on Danish and Icelandic)

Cluster C doesn't gain any interesting patterns with overuse
normalization in figures
\ref{clusterA-clusterC-feat-5-1000-trigram-over},
\ref{clusterB-clusterC-feat-5-1000-trigram-over}, and
\ref{clusterC-clusterD-feat-5-1000-trigram-over}, except for a
surprising variety in the verbs: g\"ora (do), hava (have), kunna
(can), sk\"ola (shall), vara (be) and vilja (want). Many uses of
adverbs show up as well. It is not clear what either of these patterns
mean linguistically, however.
% I have no idea whether to make that verb singular or plural.
% So like whatever.

Cluster D gives no information whatsoever when the overuse
normalization is added, simply because it has no informative
features. This is expected, given its nature as a combination of many
sites. The tradeoff of more informative features for the smaller
clusters is worthwhile.

\begin{figure}
  \includegraphics[scale=1.2]{clusterA-clusterB-feat-5-1000-trigram-over}
  \caption{cluster A $\Leftrightarrow$ cluster B, trigram features
    with overuse normalization}
  \label{clusterA-clusterB-feat-5-1000-trigram-over}
\end{figure}
\begin{figure}
  \includegraphics[scale=1.2]{clusterA-clusterC-feat-5-1000-trigram-over}
  \caption{cluster A $\Leftrightarrow$ cluster C, trigram features
    with overuse normalization}
  \label{clusterA-clusterC-feat-5-1000-trigram-over}
\end{figure}
\begin{figure}
  \includegraphics[scale=1.2]{clusterA-clusterD-feat-5-1000-trigram-over}
  \caption{cluster A $\Leftrightarrow$ cluster D, trigram features
    with overuse normalization}
  \label{clusterA-clusterD-feat-5-1000-trigram-over}
\end{figure}


\begin{figure}
  \includegraphics[scale=1.2]{clusterB-clusterC-feat-5-1000-trigram-over}
  \caption{cluster B $\Leftrightarrow$ cluster C, trigram features
    with overuse normalization}
  \label{clusterB-clusterC-feat-5-1000-trigram-over}
\end{figure}
\begin{figure}
  \includegraphics[scale=1.2]{clusterB-clusterD-feat-5-1000-trigram-over}
  \caption{cluster B $\Leftrightarrow$ cluster D, trigram features
    with overuse normalization}
  \label{clusterB-clusterD-feat-5-1000-trigram-over}
\end{figure}
\begin{figure}
  \includegraphics[scale=1.2]{clusterC-clusterD-feat-5-1000-trigram-over}
  \caption{cluster C $\Leftrightarrow$ cluster D, trigram features
    with overuse normalization}
  \label{clusterC-clusterD-feat-5-1000-trigram-over}
\end{figure}

\subsection{Variation Across Feature Sets}
\label{feature-ranking-feature-sets}

Moving to other feature sets with overuse normalization, leaf-ancestor
paths and leaf-head paths, figures
\ref{clusterA-clusterB-feat-5-1000-path-over} and
\ref{clusterA-clusterB-feat-5-1000-dep-over}, give additional
information about cluster A that lead to the conclusion its defining
characteristic is simple sentences, simpler at least than the other
clusters. Specifically, cluster A's overused leaf-ancestor paths
include few nested sentences (figure
\ref{clusterA-clusterB-feat-5-1000-path-over}). This contrasts sharply
with cluster B and cluster C in figures
\ref{clusterB-clusterC-feat-5-1000-path-over} --
\ref{clusterC-clusterD-feat-5-1000-path-over}, which include many nested
sentences. Cluster A does have complex paths, but they feature
prepositional phrases. (Note: NAC stands for ``not a constituent'' and
indicates that the parser could not decide what the correct
constituent was at that point, or that there are crossing branches,
which is less common.)

This characteristic of cluster A appears in the leaf-head paths as
well (figure \ref{clusterA-clusterB-feat-5-1000-dep-over}); cluster
A's paths contain many [adjective]-noun-preposition sequences, but few
verb-verb sequences that indicate nested phrases. Again, cluster B and
cluster C (figures \ref{clusterB-clusterC-feat-5-1000-path-over} --
\ref{clusterC-clusterD-feat-5-1000-path-over}) have many of these
sequences. Both clusters have a number of overused adverb features as
well, similar to the trigram results. Note that comparison to cluster
D is less interesting. Because it has fewer unique characteristics,
when compared to it, clusters A, B and C show more generic
characteristics. For example, all three clusters show that their
sentences are generally more complex than the general sites in cluster
D. This may be a sign that the normalizations are not fully working;
cluster D is larger than A, which is larger than clusters B and C, so
it seems that larger sets of sites are decided based on simpler
features.

\begin{figure}
  \includegraphics[scale=1.2]{clusterA-clusterB-feat-5-1000-path-over}
  \caption{cluster A $\Leftrightarrow$ cluster B, leaf-ancestor path features}
  \label{clusterA-clusterB-feat-5-1000-path-over}
\end{figure}
\begin{figure}
  \includegraphics[scale=1.2]{clusterA-clusterC-feat-5-1000-path-over}
  \caption{cluster A $\Leftrightarrow$ cluster C, leaf-ancestor path features}
  \label{clusterA-clusterC-feat-5-1000-path-over}
\end{figure}
\begin{figure}
  \includegraphics[scale=1.2]{clusterA-clusterD-feat-5-1000-path-over}
  \caption{cluster A $\Leftrightarrow$ cluster D, leaf-ancestor path features}
  \label{clusterA-clusterD-feat-5-1000-path-over}
\end{figure}
\begin{figure}
  \includegraphics[scale=1.2]{clusterB-clusterC-feat-5-1000-path-over}
  \caption{cluster B $\Leftrightarrow$ cluster C, leaf-ancestor path features}
  \label{clusterB-clusterC-feat-5-1000-path-over}
\end{figure}
\begin{figure}
  \includegraphics[scale=1.2]{clusterB-clusterD-feat-5-1000-path-over}
  \caption{cluster B $\Leftrightarrow$ cluster D, leaf-ancestor path features}
  \label{clusterB-clusterD-feat-5-1000-path-over}
\end{figure}
\begin{figure}
  \includegraphics[scale=1.2]{clusterC-clusterD-feat-5-1000-path-over}
  \caption{cluster C $\Leftrightarrow$ cluster D, leaf-ancestor path features}
  \label{clusterC-clusterD-feat-5-1000-path-over}
\end{figure}


\begin{figure}
  \includegraphics[scale=1.2]{clusterA-clusterB-feat-5-1000-dep-over}
  \caption{cluster A $\Leftrightarrow$ cluster B, leaf-head features}
  \label{clusterA-clusterB-feat-5-1000-dep-over}
\end{figure}
\begin{figure}
  \includegraphics[scale=1.2]{clusterA-clusterC-feat-5-1000-dep-over}
  \caption{cluster A $\Leftrightarrow$ cluster C, leaf-head features}
  \label{clusterA-clusterC-feat-5-1000-dep-over}
\end{figure}
\begin{figure}
  \includegraphics[scale=1.2]{clusterA-clusterD-feat-5-1000-dep-over}
  \caption{cluster A $\Leftrightarrow$ cluster D, leaf-head features}
  \label{clusterA-clusterD-feat-5-1000-dep-over}
\end{figure}
\begin{figure}
  \includegraphics[scale=1.2]{clusterB-clusterC-feat-5-1000-dep-over}
  \caption{cluster B $\Leftrightarrow$ cluster C, leaf-head features}
  \label{clusterB-clusterC-feat-5-1000-dep-over}
\end{figure}
\begin{figure}
  \includegraphics[scale=1.2]{clusterB-clusterD-feat-5-1000-dep-over}
  \caption{cluster B $\Leftrightarrow$ cluster D, leaf-head features}
  \label{clusterB-clusterD-feat-5-1000-dep-over}
\end{figure}
\begin{figure}
  \includegraphics[scale=1.2]{clusterC-clusterD-feat-5-1000-dep-over}
  \caption{cluster C $\Leftrightarrow$ cluster D, leaf-head features}
  \label{clusterC-clusterD-feat-5-1000-dep-over}
\end{figure}


\subsection{Phrase-structure rule features}
\label{feature-ranking-psg}

\begin{figure}
  \includegraphics[scale=1.2]{clusterA-clusterB-feat-5-1000-psg-over}
  \caption{cluster A $\Leftrightarrow$ cluster B, phrase-structure
    rule features}
  \label{clusterA-clusterB-feat-5-1000-psg-over2}
\end{figure}
\begin{figure}
  \includegraphics[scale=1.2]{clusterA-clusterC-feat-5-1000-psg-over}
  \caption{cluster A $\Leftrightarrow$ cluster C, phrase-structure rule features}
  \label{clusterA-clusterC-feat-5-1000-psg-over}
\end{figure}
\begin{figure}
  \includegraphics[scale=1.2]{clusterA-clusterD-feat-5-1000-psg-over}
  \caption{cluster A $\Leftrightarrow$ cluster D, phrase-structure rule features}
  \label{clusterA-clusterD-feat-5-1000-psg-over}
\end{figure}


\begin{figure}
  \includegraphics[scale=1.2]{clusterB-clusterC-feat-5-1000-psg-over}
  \caption{cluster B $\Leftrightarrow$ cluster C, phrase-structure rule features}
\end{figure}
\begin{figure}
  \includegraphics[scale=1.2]{clusterB-clusterD-feat-5-1000-psg-over}
  \caption{cluster B $\Leftrightarrow$ cluster D, phrase-structure rule features}
\end{figure}
\begin{figure}
  \includegraphics[scale=1.2]{clusterC-clusterD-feat-5-1000-psg-over}
  \caption{cluster C $\Leftrightarrow$ cluster D, phrase-structure rule features}
\end{figure}

Analysis of the phrase-structure-rule features is difficult because of
all the noise. In figures
\ref{clusterA-clusterB-feat-5-1000-psg-over2} and
\ref{clusterA-clusterD-feat-5-1000-psg-over}, features like
S$\to++$-AB (conjunction-adverb) S$\to$FV-PO-AB-VV (get
verb-pronoun-adverb-verb) are hard to describe as anything but junk
rules created by the parser. On the other hand, there are a lot of
linguistically odd but reasonable rules like S$\to$PO-AV-NP-IP
(pronoun-copula-noun phrase-period) in figure
\ref{clusterA-clusterC-feat-5-1000-psg-over}. Although this is not a
good linguistic decomposition, it is one that a statistical parser
would create when copular sentences are
common enough.

Overall both normalizations leave something to be desired; without
overuse normalization, only very common features appear. These
features convey only basic information, making it hard to identify
characteristics of a cluster. On the other hand, the overuse
normalization is susceptible to noise, especially for more error-prone
feature sets. Even though more detail may be available with this
normalization step, the features must be inspected for general trends
because individual features are not necessarily reliable.

\section{Conclusion}

This chapter provided results and analysis of the results with no
comparison to other work. The next chapter
will compare the results to dialectology, phonological dialectometry,
and previous work in syntactic dialectometry. Even before this
comparison, however, the distance measure seems to be successful at
producing dialect distance.

Quite a few patterns are visible: the significance tests show that the
distance measure is finding significant distances for most parameter
settings; the analysis of correlation shows that some correlate with
geographic and travel distance. The dendrogram maps, composite cluster
maps, and MDS maps all show a picture of with fairly well-defined areas.
Finally, the feature rankings show some interesting patterns but
nothing definitive. More qualitative analysis is needed in the future.


%%% Local Variables: 
%%% mode: latex
%%% TeX-master: "dissertation.tex"
%%% End: