-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathresults.tex
1751 lines (1567 loc) · 77.8 KB
/
results.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\chapter{Results}
\label{results-chapter}
These results are meant to answer two main questions: first, how well does
this approach to syntactic dialectometry agree with dialectology?
Second, what combinations of distance measures, feature
sets and other settings produce the best results for linguistic
analyses? Additionally, the results are meant to allow comparison with
phonological dialectometry.
The organization of this chapter mirrors the order of the methods
chapter, particularly the output analysis (section
\ref{output-analysis}). First, there is an overview of the different
parameter settings, the combinations of distance measure and feature
set, as well as other settings (section
\ref{section-parameter-settings}). Then the number of significant
distances for each parameter setting is given (section
\ref{section-significant}), which is followed by the correlation with
geography and travel distance for each parameter setting (section
\ref{section-correlation}). These sections focus mainly on detecting
which settings do not produce valid results, so that they can be
ignored in the rest of the chapter. At a high level, they answer the
question of the suitability of statistical syntactic dialectometry:
whether or not significant results can be found.
Next, the specific dialectological results are examined. First,
cluster dendrograms provide a visualization of which sites the
distance measures find to be similar (section
\ref{section-clusters}). In addition, to improve the reliability of
the dendrograms, consensus trees (section \ref{section-consensus}) and
composite cluster maps are produced (section
\ref{section-composite-cluster}). Next, multi-dimensional scaling
gives a smoother view of similarity than clusters (section
\ref{section-mds}). Finally, features are ranked and extracted from
each cluster in the consensus tree (section \ref{section-features}).
\section{Parameter Settings}
\label{section-parameter-settings}
There are 180 parameter settings investigated in this chapter. This
number arises from the four parameters: measure, feature set, sampling
method and number of normalization iterations. 5 measures, 9 feature
sets, 2 sampling methods and 2 iterations of normalization
gives $5\times 9 \times 2 \times 2=180$ different settings. The
settings are given in table \ref{parameter-settings}.
\begin{table}
\begin{tabular}{|c|} \hline
Feature Set \\\hline
Leaf-Ancestor Path \\
Part-of-speech Trigram \\
Leaf-Head Path \\
Phrase Structure Rule \\
PSR with Grandparent \\
Part-of-speech Unigram \\
Leaf-Head Path, based on Timbl training \\
Leaf-Arc Path \\
All features combined \\ \hline
\end{tabular}
\begin{tabular}{|c|} \hline
Measure \\ \hline
$R$ \\
$R^2$ \\
Kullback-Leibler divergence \\
Jensen-Shannon divergence \\
cosine dissimilarity\\\hline \hline
Sampling Method \\ \hline
1000 sentences \\
All sentences \\ \hline \hline
Iterations of normalization \\ \hline
1 \\
5 \\ \hline
\end{tabular}
\vspace{5mm}
\caption{Settings for the five parameters tested}
\label{parameter-settings}
\end{table}
% Actually, all this should probably go in methods too, somewhere as a summary.
In addition, the size of each of the 30 interview sites are given in
table \ref{corpus-size}.
\begin{table}
\begin{tabular}{c|cc|c|cc}
Site & Sentences & Words & Site & Sentences & Words \\\hline
Ankarsrum & 630 & 7708 & Leksand & 923 & 10676\\
Anundsjo & 1144 & 11897 & Loderup & 429 & 7850\\
Arsunda & 937 & 8933 & Norra Rorum & 546 & 9160\\
Asby & 693 & 7171 & Orust & 1067 & 11409\\
Bara & 696 & 10724 & Ossjo & 481 & 12275\\
Bengtsfors & 663 & 7423 & Segerstad & 837 & 9746\\
Boda & 1029 & 17425 & Skinnskatteberg & 730 & 9529\\
Bredsatra & 360 & 6938 & Sorunda & 768 & 11144\\
Faro & 659 & 8260 & Sproge & 381 & 4399\\
Floby & 557 & 6392 & StAnna & 876 & 13156\\
Fole & 727 & 9920 & Tors\aa{}s & 374 & 9217\\
Frillesas & 572 & 9634 & Torso & 956 & 15577\\
Indal & 1126 & 13090 & Vaxtorp & 903 & 11353\\
Jamshog & 301 & 8661 & Viby & 431 & 6734\\
K\"ola & 528 & 10133 & Villberga & 680 & 11479\\
\end{tabular}
\caption{Size of Interview Sites}
\label{corpus-size}
\end{table}
\section{Significant Distances}
\label{section-significant}
Significant distances help answer the question whether a syntactic
measure has succeeded in finding reliable distances; the measure will
always return some distance, but if the sites are too small, it may
not be significant. Therefore the results should have few
non-significant distances. In the tables, the total number of
comparisons between all 30 sites is the $435=30(30-1) / 2$. In the
first set, tables \ref{sig-1-1000} -- \ref{sig-1-full}, the results
are shown from one iteration of the normalization step. In the second
set, tables \ref{sig-5-1000} -- \ref{sig-5-full}, the results from
five normalization iterations are shown.
Bold numbers in the tables indicate that fewer than 95\% of the
distances were significant. In table \ref{sig-5-full}, the
5-iteration table that compares full sites, the only combination
with {\it less} than 5\% non-significant results is cosine
dissimilarity with unigram features, marked with italics. Note that
here, 5\% is an arbitrary cutoff point not related to the usual
significance cutoff $p < 0.05$; the basis for these tables are
themselves number of significant distances found.
\begin{table}
\begin{tabular}{l|rrrrr}
& $R$ & $R^2$ & KL & JS & cos \\ \hline
Leaf-Ancestor &0&0&11&0&0 \\
Trigram &0&0&0&0&0 \\
Leaf-Head &0&0&0&0&0 \\
Phrase-Structure Rules &0&0&\textbf{95}&0&0 \\
Phrase-Structure with Grandparents &0&0&\textbf{273}&0&0 \\
Unigram &0&0&0&0&0 \\
Leaf-Head with MaltParser trained by Timbl &0&0&\textbf{47}&0&0 \\
Leaf-Arc Labels&0&0&0&0&0 \\
All Features Combined &0&0&0&0&0 \\
\end{tabular}
\caption{Number of non-significant distances for sample size 1000, 1
normalization}
\label{sig-1-1000}
\end{table}
\begin{table}
\begin{tabular}{l|rrrrr}
& $R$ & $R^2$ & KL & JS & cos \\ \hline
Leaf-Ancestor&7&11&12&\textbf{35}&9 \\
Trigram&4&1&0&\textbf{24}&1 \\
Leaf-Head&10&12&20&\textbf{44}&19 \\
Phrase-Structure Rules&\textbf{26}&17&\textbf{24}&\textbf{49}&20 \\
Phrase-Structure with Grandparents&\textbf{58}&\textbf{35}&\textbf{38}&\textbf{71}&\textbf{33}
\\
Unigram&1&2&0&0&2 \\
Leaf-Head with MaltParser trained by Timbl&11&21&18&\textbf{74}&\textbf{30}
\\
Leaf-Arc Labels&14&19&\textbf{37}&\textbf{94}&17 \\
All Features Combined&0&0&1&8&2 \\
\end{tabular}
\caption{Number of non-significant distances for complete sites, 1
normalization}
\label{sig-1-full}
\end{table}
\begin{table}
\begin{tabular}{l|rrrrr}
& $R$ & $R^2$ & KL & JS & cos \\ \hline
Leaf-Ancestor&5 & \textbf{56} & \textbf{34} & 0 & 0\\
Trigram&3 & 2 & 0 & 0 & 0\\
Leaf-Head&3 & 14 & 4 & 0 & 0\\
Phrase-Structure Rules&11 & 4 & \textbf{66} & 1 & 0\\
Phrase-Structure with Grandparents&18 & 0 & \textbf{109} & 4 & 0\\
Unigram&\textbf{52} & \textbf{53} & 15 & 17 & 0\\
Leaf-Head with MaltParser trained by Timbl&7 & 20 & \textbf{45} & 0 & 0\\
Leaf-Arc Labels&6 & \textbf{54} & 17 & 1 & 0\\
All Features Combined&0 & 4 & 0 & 0 & 0\\
\end{tabular}
\caption{Number of non-significant distances for sample size 1000, 5
normalizations}
\label{sig-5-1000}
\end{table}
\begin{table}
\begin{tabular}{l|rrrrr}
& $R$ & $R^2$ & KL & JS & cos \\ \hline
Leaf-Ancestor&\textbf{290} & \textbf{284} & \textbf{287} & \textbf{278} & \textbf{204}\\
Trigram&\textbf{284} & \textbf{283} & \textbf{283} & \textbf{276} & \textbf{196}\\
Leaf-Head&\textbf{293} & \textbf{286} & \textbf{285} & \textbf{279} & \textbf{211}\\
Phrase-Structure Rules&\textbf{289} & \textbf{294} & \textbf{286} & \textbf{275} & \textbf{236}\\
Phrase-Structure with Grandparents&\textbf{285} & \textbf{290} & \textbf{286} & \textbf{270} & \textbf{258}\\
Unigram&\textbf{297} & \textbf{296} & \textbf{294} & \textbf{293} &
\textit{9}\\
Leaf-Head with MaltParser trained by Timbl&\textbf{294} & \textbf{289} & \textbf{288} & \textbf{284} & \textbf{222}\\
Leaf-Arc Labels&\textbf{294} & \textbf{290} & \textbf{291} & \textbf{293} & \textbf{162}\\
All Features Combined&\textbf{279} & \textbf{279} & \textbf{279} & \textbf{269} & \textbf{191}\\
\end{tabular}
\caption{Number of non-significant distances for complete sites, 5
normalizations}
\label{sig-5-full}
\end{table}
Analysis of the significance of dialect distance provides a measure of
how reliable the distances to be analyzed later in this chapter are. A
distance that does not find significant distances between the 30
sites is not suitable for precise inspection, although small numbers
of non-significant distances will still allow methods to
return interpretable results.
The highest number of significant distances are found in the first
case (table \ref{sig-1-1000}): 1 round of normalization with a
fixed-size sample of 1000 sentences. From there, both full-site
comparisons (table \ref{sig-1-full}) and 5 rounds of normalization
(table \ref{sig-5-1000}) have fewer significant distances, although
the number is still usable. However, the combination of the two, with
5 rounds of normalization over full-site comparisons, has only one
combination with fewer than 5\% of distances that are {\it not}
significant. Although both full-site comparisons and multiple rounds
of normalization may increase the precision of the results, their
combined effect on significance is so detrimental that its results are
useless. For the rest of the analysis, the combination of full-site
comparison and 5 rounds of normalization will be skipped.
\subsection{Significance by Measure}
The distance measures most likely to find significance are, in order,
cosine dissimilarity, Jensen-Shannon divergence and $R$. Each method
had different parameter settings for which it was stronger. For
1000-sentence sampling, tables \ref{sig-1-1000} and \ref{sig-5-1000},
cosine similarity resulted in all significant distances, even for
part-of-speech unigrams, which are intended as the baseline feature
set. Excluding unigrams, Jensen-Shannon divergence has similar
performance. For full-site comparisons, tables \ref{sig-1-full} and
\ref{sig-5-full}, both perform considerably worse; surprisingly, both
perform better on unigram features, Jensen-Shannon so much so that it
is the only feature set for which it finds all significant
distances. $R$, on the other hand, performs decently on all
combinations of parameter settings; its low significance for phrase
structure rules is shared by Kullback-Leibler and Jensen-Shannon
divergences.
When comparing the performance of Kullback-Leibler and Jensen-Shannon
divergence it is not surprising that Jensen-Shannon outperforms
Kullback-Leibler on fixed-size sampling. Although both are called
``divergence'', Jensen-Shannon divergence is actually a
dissimilarity. Recall that the divergence from point A to B may differ
from the divergence from point B to A. A divergence like
Kullback-Leibler can be converted to a dissimilarity by measuring
$KL(A,B) + KL(B,A)$. However, this dissimilarity must skip features
unique to a single site in order to avoid division by zero. This
means that for smaller sites Kullback-Leibler loses information that
Jensen-Shannon is able to use. On the other hand, while this may
explain Kullback-Leibler's improved performance for full-site
comparisons, it doesn't explain Jensen-Shannon's much worse
performance.
\subsection{Significance by Feature Set}
% \item Unigrams do form an adequate baseline; they are bad but not too
% bad.
% The feature sets most likely to find significance are the combined
% features and unigrams., in order,
% trigrams, all combined features and leaf-head paths (both with
% support-vector-machine training and with Timbl's instance-based
% training). Without ratio normalization, the other feature sets are not
% much worse, but with it included, these three are the best by some
% distance.
For 1 round of normalization, the best feature sets are the simple
ones: trigrams and unigrams, as well all combined features. On the
other hand, trigrams and leaf-head paths (with its variations) are the
best feature sets with 5 rounds of normalization. However, the
variation isn't strong; any feature set can give good results with the
right distance measure. The problem is that no clear patterns emerge.
The relatively high quality of trigrams and unigrams does not make
sense given only the linguistic facts; however, it is likely that the
entirely automatic annotation used here introduces more and more
errors as more annotators run, operating on previous automatic
annotations. Trigrams are the result of only one automatic annotation,
and one for which the state of the art is near human performance. So
the fact that these particular parts of speech are of higher quality
than the corresponding dependencies or constituencies is probably the
deciding factor in their higher number of significant
distances.
% Although it is impossible to tell from my results, I
% predict that a manually annotated dialect corpus would show that
% non-flat syntactic structure is helpful in producing significant
% distances.
Given the above facts, the question should rather be: why do leaf-head
paths perform as well as they do? Better, for example, than the
leaf-ancestor paths on which they're modeled: why does more
normalization hurt leaf-ancestor paths but not leaf-head paths? It
could be that there is less room for error; many of the common
leaf-head paths are short: short interview sentences with simple
structure make for shorter leaf-head paths than leaf-ancestor
paths. As a result, the important leaf-head paths consist mainly of a
couple of parts-of-speech. This difference in feature length holds for
any length of sentence, but is exaggerated for simple sentences, where
the amount of structure generated for a phrase-structure parse for a
clause is more than for a dependency parse. In general, clauses,
embedded and otherwise, produce the largest difference in amount of
structure between the two, so the feature length differs for deeply
nested sentences as well.
Another reason could be a difference in parsers: MaltParser has been
tested on Swedish by its designers \cite{nivre06b}. Besides English, the Berkeley
parser has been tested prominently on German and Chinese. Therefore,
the difference would better be explained by appealing to the
difference in parsers rather than an unsuitability of Swedish for
constituent analysis.
It is disappointing linguistically that trigrams provide the most
reliable results so far; a linguist would expect that including
syntactic information would make it easier to measure the differences
between sites. If it is, as hypothesized here, an effect of chaining
machine annotators, a study using a manually annotated corpus could
detect this. However, it still means that trigrams are the most useful
feature set from a practical view, because automatic trigram tagging
is very close to human performance with little training. That means
the only required human work is the transcription of interviews in
most cases.
On the other hand, if additional features sets are to be developed for
a corpus, then combining all available features seems to be a
successful strategy. The distance measures seem to be able to use all
available information for finding significant distances.
\section{Correlation}
\label{section-correlation}
In dialectology, the default expectation for dialect distance is that
it correlates with geographic distance \cite{chambers98}. A lack of
correlation does not necessarily mean that a measure is invalid, but
presence of correlation means that the distance measure substantiates the
well-known tendency of dialect distributions to be more or less
smoothly gradient over physical space.
In addition, distance measures are more likely to correlate
significantly with travel distance than with straight-line geographic
distance. This makes sense since the difficulty of moving from place
to place is what influences dialect formation, and taking roads into
account is an improved estimate over straight-line distance.
The tables that present geographic and table correlation,
\ref{cor-1-1000} -- \ref{travel-cor-5-full}, mark significant
correlations with a star for $p < 0.05$, two stars for $p < 0.01$ and
three stars for $p < 0.001$. However, these correlations are only
trustworthy in the case that the underlying distances are
significant. Significant correlations from significant distances (as
cross-referenced from tables \ref{sig-1-1000} -- \ref{sig-5-full}) are
marked by italics.
Besides this, correlation between combinations of measure/feature set
can show how closely related they are--in other words, how similarly
they view the underlying data which remains the same for all. It is
analyzed in section \ref{results-chapter-inter-measure-correlation}.
This is similar to the reasoning behind correlation with
geography---but the assumption is that geography is a factor
underlying dialect formation; while the distance measure measures some
aspect of the language which we hope is dialects, it is indirectly
(even less directly) measuring the geography. Therefore, correlation
with geography should occur.
Third, correlation with corpus size is not predicted and is probably
an undesired defect in sampling or normalization. Correlation with
corpus size is presented in tables \ref{size-cor-1-1000} --
\ref{size-cor-5-full}.
\begin{table}
\begin{tabular}{l|rrrrr}
& $R$ & $R^2$ & KL & JS & cos \\ \hline
Leaf-Ancestor&-0.01 & 0.03 & 0.02 & -0.02 & 0.08\\
Trigram&0.17 & 0.17 & 0.10 & 0.19 & 0.13\\
Leaf-Head&-0.06 & 0.03 & 0.00 & -0.07 & 0.05\\
Phrase-Structure Rules&0.01 & \textit{0.18*} & 0.16 & 0.01 & 0.12\\
Phrase-Structure with Grandparents&0.03 & \textit{0.25*} & 0.21* & 0.03 & 0.12\\
Unigram&\textit{0.18*} & 0.17 & \textit{0.29**} & \textit{0.30**} & \textit{0.18*}\\
Dependencies, MaltParser trained by Timbl&-0.07 & 0.02 & -0.00 & -0.08 & 0.05\\
Arc-Head&-0.07 & 0.06 & -0.06 & -0.09 & 0.00\\
All Features Combined&-0.02 & 0.03 & 0.01 & -0.02 & 0.07\\
\end{tabular}
\caption{Geographic correlation for sample size 1000, 1 normalization iteration}
\label{cor-1-1000}
\end{table}
\begin{table}
\begin{tabular}{l|rrrrr}
& $R$ & $R^2$ & KL & JS & cos \\ \hline
Leaf-Ancestor&0.02 & 0.09 & 0.11 & -0.00 & 0.09\\
Trigram&\textit{0.27*} & \textit{0.26*} & \textit{0.30**} & 0.21* & 0.08\\
Leaf-Head&-0.03 & 0.12 & 0.14 & -0.06 & 0.02\\
Phrase-Structure Rules&0.13 & \textit{0.36**} & 0.30** & 0.11 & \textit{0.20*}\\
Phrase-Structure with Grandparents&0.15 & 0.41** & 0.36** & 0.14 & 0.19*\\
Unigram&\textit{0.20*} & \textit{0.20*} & \textit{0.33**} & \textit{0.33**} & \textit{0.22*}\\
Dependencies, MaltParser trained by Timbl&-0.02 & 0.14 & 0.16 & -0.05 & 0.02\\
Arc-Head&-0.06 & 0.13 & -0.01 & -0.12 & -0.03\\
All Features Combined&0.03 & 0.11 & 0.16 & -0.00 & 0.04\\
\end{tabular}
\caption{Geographic correlation for complete sites, 1 normalization iteration}
\label{cor-1-full}
\end{table}
\begin{table}
\begin{tabular}{l|rrrrr}
& $R$ & $R^2$ & KL & JS & cos \\ \hline
Leaf-Ancestor&0.14 & 0.14 & 0.16 & 0.15 & 0.08\\
Trigram&\textit{0.22*} & 0.17 & \textit{0.22*} & \textit{0.22*} & 0.16\\
Leaf-Head&0.10 & 0.11 & 0.15 & 0.12 & 0.10\\
Phrase-Structure Rules&0.14 & 0.10 & 0.14 & 0.15 & 0.06\\
Phrase-Structure with Grandparents&0.16 & 0.14 & 0.14 & 0.15 & 0.05\\
Unigram&0.12 & 0.11 & 0.14 & 0.13 & 0.17\\
Dependencies, MaltParser trained by Timbl&0.09 & 0.12 & 0.16 & 0.11 & 0.11\\
Arc-Head&0.08 & 0.10 & 0.14 & 0.10 & 0.09\\
All Features Combined&0.19 & 0.16 & \textit{0.20*} & \textit{0.21*} & 0.11\\
\end{tabular}
\caption{Geographic correlation for sample size 1000, 5
normalizations}
\label{cor-5-1000}
\end{table}
\begin{table}
\begin{tabular}{l|rrrrr}
& $R$ & $R^2$ & KL & JS & cos \\ \hline
Leaf-Ancestor&-0.14 & -0.16 & -0.15 & -0.15 & -0.08\\
Trigram&-0.09 & -0.07 & -0.09 & -0.09 & -0.09\\
Leaf-Head&-0.22 & -0.21 & -0.18 & -0.22 & -0.10\\
Phrase-Structure Rules&-0.19 & -0.14 & -0.11 & -0.20 & -0.01\\
Phrase-Structure with Grandparents&-0.17 & -0.11 & -0.09 & -0.18 & -0.02\\
Unigram&-0.10 & -0.06 & -0.07 & -0.08 & 0.14\\
Dependencies, MaltParser trained by Timbl&-0.19 & -0.18 & -0.18 & -0.19 & -0.10\\
Arc-Head&-0.21 & -0.18 & -0.18 & -0.21 & -0.10\\
All Features Combined&-0.18 & -0.18 & -0.16 & -0.18 & -0.09\\
\end{tabular}
\caption{Geographic correlation for complete sites, 5 normalizations}
\label{cor-5-full}
\end{table}
\begin{table}
\begin{tabular}{l|rrrrr}
& $R$ & $R^2$ & KL & JS & cos \\ \hline
Leaf-Ancestor&-0.03 & 0.02 & 0.01 & -0.04 & 0.07\\
Trigram&0.20 & 0.19 & 0.11 & \textit{0.23*} & 0.14\\
Leaf-Head&-0.07 & 0.01 & -0.01 & -0.08 & 0.05\\
Phrase-Structure Rules&0.01 & \textit{0.18*} & 0.17 & 0.00 & 0.14\\
Phrase-Structure with Grandparents&0.03 & \textit{0.26*} & 0.22* & 0.03 & 0.15\\
Unigram&\textit{0.20*} & \textit{0.19*} & \textit{0.30**} & \textit{0.31**} & \textit{0.21*}\\
Dependencies, MaltParser trained by Timbl&-0.08 & 0.02 & -0.01 & -0.09 & 0.05\\
Arc-Head&-0.08 & 0.05 & -0.06 & -0.10 & 0.00\\
All Features Combined&-0.03 & 0.03 & 0.01 & -0.03 & 0.06\\
\end{tabular}
\caption{Travel correlation for sample size 1000, 1 normalization iteration}
\label{travel-cor-1-1000}
\end{table}
\begin{table}
\begin{tabular}{l|rrrrr}
& $R$ & $R^2$ & KL & JS & cos \\ \hline
Leaf-Ancestor&0.02 & 0.08 & 0.11 & 0.00 & 0.08\\
Trigram&\textit{0.31*} & \textit{0.28*} & \textit{0.32**} & 0.26* & 0.09\\
Leaf-Head&-0.02 & 0.12 & 0.13 & -0.05 & 0.01\\
Phrase-Structure Rules&0.15 & \textit{0.37**} & 0.32** & 0.13 & \textit{0.22*}\\
Phrase-Structure with Grandparents&0.17 & 0.43** & 0.38** & 0.16 & 0.22*\\
Unigram&\textit{0.22*} & \textit{0.22*} & \textit{0.33**} & \textit{0.34**} & \textit{0.24*}\\
Dependencies, MaltParser trained by Timbl&-0.01 & 0.14 & 0.17 & -0.04 & 0.02\\
Arc-Head&-0.06 & 0.12 & -0.02 & -0.12 & -0.03\\
All Features Combined&0.04 & 0.10 & 0.16 & 0.01 & 0.04\\
\end{tabular}
\caption{Travel correlation for complete sites, 1 normalization iteration}
\label{travel-cor-1-full}
\end{table}
\begin{table}
\begin{tabular}{l|rrrrr}
& $R$ & $R^2$ & KL & JS & cos \\ \hline
Leaf-Ancestor&0.17 & 0.19* & 0.17* & 0.18 & 0.07\\
Trigram&\textit{0.24*} & \textit{0.20*} & \textit{0.25*} & \textit{0.26*} & 0.16\\
Leaf-Head&0.14 & 0.16 & 0.17 & 0.15 & 0.10\\
Phrase-Structure Rules&0.17 & 0.14 & 0.16* & 0.18 & 0.06\\
Phrase-Structure with Grandparents&0.19 & \textit{0.18*} & 0.17* & 0.19 & 0.06\\
Unigram&0.15 & 0.13 & \textit{0.17*} & 0.16 & \textit{0.20*}\\
Dependencies, MaltParser trained by Timbl&0.12 & 0.16 & 0.18 & 0.14 & 0.11\\
Arc-Head&0.09 & 0.13 & 0.14 & 0.11 & 0.08\\
All Features Combined&\textit{0.23*} & \textit{0.20*} & \textit{0.22*} & \textit{0.24*} & 0.11\\
\end{tabular}
\caption{Travel correlation for sample size 1000, 5 normalizations}
\label{travel-cor-5-1000}
\end{table}
\begin{table}
\begin{tabular}{l|rrrrr}
& $R$ & $R^2$ & KL & JS & cos \\ \hline
Leaf-Ancestor&-0.13 & -0.13 & -0.10 & -0.13 & -0.04\\
Trigram&-0.06 & -0.04 & -0.05 & -0.06 & -0.05\\
Leaf-Head&-0.20 & -0.17 & -0.13 & -0.19 & -0.06\\
Phrase-Structure Rules&-0.15 & -0.08 & -0.05 & -0.15 & 0.04\\
Phrase-Structure with Grandparents&-0.12 & -0.05 & -0.03 & -0.13 & 0.03\\
Unigram&-0.07 & -0.03 & -0.04 & -0.05 & \textit{0.18*}\\
Dependencies, MaltParser trained by Timbl&-0.18 & -0.15 & -0.12 & -0.18 & -0.05\\
Arc-Head&-0.20 & -0.17 & -0.14 & -0.19 & -0.06\\
All Features Combined&-0.16 & -0.14 & -0.11 & -0.15 & -0.05\\
\end{tabular}
\caption{Travel correlation for complete sites, 5 normalizations}
\label{travel-cor-5-full}
\end{table}
From tables \ref{cor-1-1000} -- \ref{travel-cor-5-full} we see that parameter settings that correlate
significantly do so at rates around 0.2 to 0.3, with a high of 0.37
for phrase-structure-rule features measured by $R^2$, 1 normalization
iteration and comparison of full sites. The significant
correlations are mostly concentrated in the trigram, unigram and
combined feature sets.
\subsection{Analysis}
As with the number of significant distances, trigrams and unigrams are
the most likely to correlate with geographic and travel distance,
as well as the combined feature set for the 5-normalization parameter
setting.
% As before, a possible explanation is that unigrams are
% simpler, so the type count is a higher than for other measures. With
% more rounds of normalization, more correlations shift over to
% trigrams.
Note that in tables \ref{cor-1-1000} --
\ref{travel-cor-5-full}, the significant correlations are marked with
an asterisk, but only the italicized correlations are based on at
least 95\% significant distances. For example, this means that most of
the significant correlations based on phrase-structure rules are not valid.
It is worthwhile to note, however, that the valid and significant
correlations based on phrase-structure grammars give the highest
correlations: 0.37 for $R^2$ with full-site comparisons and 1 round
of normalization.
The addition of more data and more normalization is interesting in
expanding the correlating parameter settings beyond those that include
unigram features. It may be that this is an instance of the noise/quality tradeoff.
These additions appear to extract more detail from
the data, at the cost of additional interference from noisy data.
% Goes here: Fevered speculation about why travel correlation is *better* with
% the methods that correlate *less*, for 1-full at least.
% OK never mind this isn't true.
\subsection{Inter-measure Correlation}
\label{results-chapter-inter-measure-correlation}
Correlation between measures shows that they produce similar
results. It also suggests that they use similar information to do
so. For example, cosine similarity correlates the least with the
others, which means that its results are the least like the
others. It also implies that cosine similarity uses information from
input features differently than the other measures. Since the
performance of the summed, non-cosine measures is a little better for
this site size, practical use of this distance method should probably
start with them. In other computational linguistic applications,
cosine distance is typically used with larger corpora, so it is
possible that it provides better results with larger corpora, such as
corpora based on entire provinces of Sweden rather than the individual
villages used in this dissertation.
The average correlation between different measures is given in table
\ref{self-correlation-measures}. The correlations are averaged over
the correlations for all combinations of feature set with
1000-sentence samples and with non-significant correlations removed
before averaging.
\begin{table}
\begin{tabular}{r|cccc}
& $R^2$ & $KL$ & $JS$ & cos \\ \hline
$R$ & 0.85 & 0.85 & 0.98 & 0.39\\
$R^2$&& 0.90 & 0.83 & 0.57\\
$KL$ &&& 0.88 & 0.67\\
$JS$ &&&& 0.44
\end{tabular}
\caption{Average Inter-measure-correlation of measures}
\label{self-correlation-measures}
\end{table}
The inter-measure correlation is essentially a summary of the
results from the significance testing and correlations. $R$ and
Jensen-Shannon produce nearly identical results, and also correlate
highly. Cosine similarity is quite different from the other measures,
though the correlation is still higher than with travel distance. This
is expected insofar as the cosine operation at the heart of cosine similarity
differs more from the sums of absolute values or logarithms of other
measures.
\subsection{Correlation with Corpus Size}
As previously stated, correlation with corpus size is not predicted and is probably
an undesired defect in sampling or normalization. Correlation with
corpus size is presented in tables \ref{size-cor-1-1000} --
\ref{size-cor-5-full}.
Corpus size between two sites can be measured in two different ways:
either by the sum of the sites' sizes, or by the difference. Here
the sum is used: a larger sum means more tokens. If there is a
correlation with size, it must arise because higher token counts are
not properly normalized. In other words, two large sites will
have more tokens, leading to higher type counts, which directly leads
to higher distances. Smaller sites will lead to lower distances.
\begin{table}
\begin{tabular}{l|rrrrr}
& $R$ & $R^2$ & KL & JS & cos \\ \hline
Leaf-Ancestor&-0.38 & -0.26 & -0.37 & -0.40 & -0.37\\
Trigram&0.12 & -0.12 & -0.16 & 0.14 & -0.18\\
Leaf-Head&-0.39 & -0.26 & -0.35 & -0.43 & -0.39\\
Phrase-Structure Rules&0.06 & 0.15 & 0.00 & 0.03 & -0.10\\
Phrase-Structure with Grandparents&0.08 & 0.19 & 0.07 & 0.04 & -0.09\\
Unigram&-0.08 & -0.14 & -0.09 & -0.09 & -0.10\\
Dependencies, MaltParser trained by Timbl&-0.35 & -0.23 & -0.28 & -0.37 & -0.37\\
Arc-Head&-0.44 & -0.26 & -0.40 & -0.48 & -0.34\\
All Features Combined&-0.37 & -0.26 & -0.38 & -0.42 & -0.40\\
\end{tabular}
\caption{Size correlation for sample size 1000, 1 normalization}
\label{size-cor-1-1000}
\end{table}
\begin{table}
\begin{tabular}{l|rrrrr}
& $R$ & $R^2$ & KL & JS & cos \\ \hline
Leaf-Ancestor&-0.19 & -0.15 & -0.16 & -0.24 & -0.36\\
Trigram&\textit{0.30*} & 0.08 & 0.19 & 0.08 & -0.39\\
Leaf-Head&-0.17 & -0.06 & -0.08 & -0.26 & -0.41\\
Phrase-Structure Rules&0.52** & \textit{0.40**} & 0.30* & 0.47** & -0.21\\
Phrase-Structure with Grandparents&0.54** & 0.43** & 0.37** & 0.50** & -0.22\\
Unigram&-0.09 & -0.13 & -0.11 & -0.13 & -0.13\\
Dependencies, MaltParser trained by Timbl&-0.08 & 0.02 & 0.09 & -0.14 & -0.39\\
Arc-Head&-0.32 & -0.16 & -0.26 & -0.40 & -0.35\\
All Features Combined&-0.15 & -0.11 & -0.10 & -0.25 & -0.42\\
\end{tabular}
\caption{Size correlation for complete sites, 1 normalization}
\label{size-cor-1-full}
\end{table}
\begin{table}
\begin{tabular}{l|rrrrr}
& $R$ & $R^2$ & KL & JS & cos \\ \hline
Leaf-Ancestor&\textit{0.35*} & 0.36** & 0.06 & 0.27 & -0.32\\
Trigram&\textit{0.75**} & \textit{0.63**} & \textit{0.46**} & \textit{0.68**} & -0.24\\
Leaf-Head&\textit{0.46**} & \textit{0.44**} & 0.14 & \textit{0.38**} & -0.33\\
Phrase-Structure Rules&\textit{0.85**} & \textit{0.59**} & 0.36** & \textit{0.85**} & -0.34\\
Phrase-Structure with Grandparents&\textit{0.88**} & \textit{0.66**} & 0.40** & \textit{0.88**} & -0.36\\
Unigram&0.38** & 0.35** & 0.14 & 0.19 & -0.04\\
Dependencies, MaltParser trained by Timbl&\textit{0.44**} & \textit{0.41**} & 0.16 & \textit{0.39*} & -0.30\\
Arc-Head&0.20 & 0.28* & -0.00 & 0.09 & -0.28\\
All Features Combined&\textit{0.58**} & \textit{0.48**} & 0.21 & \textit{0.47**} & -0.31\\
\end{tabular}
\caption{Size correlation for sample size 1000, 5 normalizations}
\label{size-cor-5-1000}
\end{table}
\begin{table}
\begin{tabular}{l|rrrrr}
& $R$ & $R^2$ & KL & JS & cos \\ \hline
Leaf-Ancestor&-0.55 & -0.38 & -0.26 & -0.53 & -0.17\\
Trigram&-0.29 & -0.27 & -0.19 & -0.26 & -0.14\\
Leaf-Head&-0.61 & -0.43 & -0.27 & -0.58 & -0.18\\
Phrase-Structure Rules&-0.21 & -0.08 & -0.04 & -0.22 & -0.14\\
Phrase-Structure with Grandparents&-0.24 & -0.08 & -0.03 & -0.26 & -0.14\\
Unigram&-0.38 & -0.25 & -0.30 & -0.32 & -0.08\\
Dependencies, MaltParser trained by Timbl&-0.52 & -0.33 & -0.20 & -0.51 & -0.15\\
Arc-Head&-0.59 & -0.45 & -0.33 & -0.54 & -0.20\\
All Features Combined&-0.61 & -0.44 & -0.26 & -0.55 & -0.18\\
\end{tabular}
\caption{Size correlation for complete sites, 5 normalizations}
\label{size-cor-5-full}
\end{table}
In tables \ref{size-cor-1-1000} and \ref{size-cor-1-full}, the
1-normalized correlations, only two correlations are
significant. However, in table \ref{size-cor-5-1000}, 5-normalized
correlations with 1000-sampling, a large number of correlations are
significant. Specifically, the highest performing measures, $R$, $R^2$,
and Jensen-Shannon divergence, correlate significantly with size for
nearly all feature sets. Since this is not a predicted correlation, it
means that these distances may be invalid. However, another piece of
evidence makes this conclusion uncertain: geographics distance also
correlates with corpus size at a rate of 0.31, $p < 0.01$, and travel
distance correlates at 0.32, $p < 0.01$. This correlation is also
unexpected, since there is no reason to expect that distance predicts
corpus size or vice versa. However, it shows that the size correlation of
dialect distance may at least by partly explained here by the
unexpected correlation with geographic and travel distance. Therefore,
5-normalised results will be presented throughout the rest of the
results.
\subsubsection{Analysis}
The correlation of corpus size and dialect distance is a problem. It
is not a predicted as a side effect of the way dialect distance is
measured. The fact that travel distance also correlates with corpus
size at a rate of 0.32 confuses the issue further. Is corpus size the
determining variable? Or is there an unknown variable influencing all
three? One possibility is ``interviewer boundaries'', common in
corpora collected by multiple people \cite{nerbonne03}. Perhaps a
single interviewer improved with practice and collected longer interviews as
the interview collection progressed. Or perhaps cultural differences
between the interviewer and interviewees caused some participants in
one area to talk more than in another area.
Although the size correlation of the dialect distances may be
explained by the correlation with geographic/travel distance, they are still
somewhat worrying. There is a great enough difference above the
correlation of corpus size and geographic/travel distance that 5-normalized
distances might not be reliable.
However, if 5-normalization introduces a dependency on corpus size,
then the distances from full-corpus comparisons should correlate even
more highly. This is not the case.
% It appears that multiple rounds of
% normalization inadvertently re-introduce a dependency on size.
% TODO: This probably IS a bug in that only Freq norm can be
% iterated. Ratio norm should probably be in a separate loop like so:
% #ifdef RATIO_NORM
% for(sample::iterator i = ab.begin(); i!=ab.end(); i++) {
% i->second.first *= 2 * types / tokens;
% i->second.second *= 2 * types / tokens;
% }
% #endif
% TODO: I also should write this up when I have time
Alternatively, it is possible that the fixed-size sampling method is not
properly eliminating size differences between interview sites. Future work
should develop a method for normalizing a comparison between two full
sites. It should avoid sampling, but also take the relative number
of sentences into account.
\section{Clusters}
\label{section-clusters}
Cluster dendrograms provide a visualization of which sites the
distance measures find most similar. They are formed in a bottom-up
manner, repeatedly merging the two most similar groups at each step
until only one group remains. The resulting dendrogram usually has
obvious sub-trees which can be treated as clusters. By grouping sites
into clusters, cluster dendrograms allow closer comparison to
dialectology than correlation. These clusters can be compared to the
regions proposed by syntactic dialectology.
The first two dendrograms in this section hold feature set, measure,
and sample size constant at trigrams, Jensen-Shannon, and
1000-sentence samples, respectively. Then they vary the amount of
normalization: figure \ref{cluster-1-js-trigram} has 1 normalization
round, while figure \ref{cluster-5-js-trigram} has 5. These two examples
were chosen because of their high numbers of significant
distances and correlation with travel distance; the highest
correlation of 5-normalized distances with travel distance, 0.26, is
with the Jensen-Shannon measure and trigram features in figure
\ref{cluster-5-js-trigram}.
The third figure, figure \ref{cluster-1-r_sq-psg}, gives the dendrogram
for the parameter settings with the highest travel distance
correlation, phrase-structure rules, 1 normalization, 1000-sentence
samples, and $R^2$ measure. The highest correlation of 1-normalized
distances with travel distance, 0.37, is given by $R^2$ measured over
phrase-structure-rule features, comparing full sites. Those parameter
settings produce the dendrogram in figure
\ref{cluster-1-r_sq-psg}.
% Within the same settings for sampling and number of normalization
% iterations, the clusters based on sentence-length normalization alone are fairly
% similar, regardless of measure and feature set. Changing the sampling
% settings or the number of normalizations substantial reconfiguration.
% For example, the clusters produced by $R$ (figure
% \ref{cluster-1-r-trigram}) and Jensen-Shannon divergence are fairly
% similar (figure \ref{cluster-1-js-trigram}). Both are based on trigram
% features with sentence-length normalization only. Those dendrograms
% differ from their 5-normalized equivalents, figures
% \ref{cluster-5-r-trigram} and \ref{cluster-5-js-trigram}.
% \begin{figure}
% \includegraphics[width=0.9\textwidth]{dist-1-1000-r-trigram-ratio-clusterward}
% \caption{Dendrogram With $R$
% measure and trigram features, 1 normalization, 1000 samples}
% \label{cluster-1-r-trigram}
% \end{figure}
% TODO: Remove R.app's captions in favour of mine.
% TODO: Remove R.app's x-scale (y-scale) too
\begin{figure}
\includegraphics[width=0.9\textwidth]{dist-1-1000-js-trigram-ratio-clusterward}
\caption{Dendrogram With Jensen-Shannon
measure and trigram features, 1 normalization, 1000 samples}
\label{cluster-1-js-trigram}
\end{figure}
\begin{figure}
\includegraphics[width=0.9\textwidth]{dist-5-1000-js-trigram-ratio-clusterward}
\caption{Dendrogram With Jensen-Shannon
measure and trigram features, 5 normalizations, 1000 samples}
\label{cluster-5-js-trigram}
\end{figure}
\begin{figure}
\includegraphics[width=0.9\textwidth]{dist-1-full-r_sq-psg-ratio-clusterward}
\caption{Dendrogram With $R^2$ measure and phrase-structure-rule features,
1 normalization, complete sites}
\label{cluster-1-r_sq-psg}
\end{figure}
% \begin{figure}
% \includegraphics[width=0.9\textwidth]{dist-5-1000-r-trigram-ratio-clusterward}
% \caption{Dendrogram With $R$ measure and trigram features, 5 normalizations, 1000 samples}
% \label{cluster-5-r-trigram}
% \end{figure}
Unlike the significances, cosine similarity's dendrograms are fairly
similar to those of other features. See for example figure
\ref{cluster-5-cos-trigram}, with cosine, trigram features and
5 iterations of normalization.
However, it is difficult to judge the amount of agreement between
these individual dendrograms. These figures are mostly given as
examples rather than for in-depth comparison. Instead of manually
comparing each to the dialect regions of Sweden, a better option is to
aggregrate them automatically into a single dendrogram, retaining only
the clusters that agree. This is a consensus tree.
\begin{figure}
\includegraphics[width=0.9\textwidth]{dist-5-1000-cos-trigram-ratio-clusterward}
\caption{Dendrogram with cosine measure and trigram features, 5
normalizations}
\label{cluster-5-cos-trigram}
\end{figure}
\subsection{Consensus Trees}
\label{section-consensus}
Consensus trees combine the results of cluster dendrograms, retaining
only clusters that occur in the majority of dendrograms. When
dendrograms have high agreement, the resulting consensus tree will
retain most of the detail. When dendrograms have low agreement, the
resulting consensus tree will be fairly flat. This avoids the
dendrograms' problem of instability, where small changes in distances
cause large re-arrangements in the tree. Only dendrograms whose input
distances were at least 95\% significant were used. That is, a
measure/feature set combination had to be non-bold in tables
\ref{sig-1-1000} to \ref{sig-5-full} to be included. The consensus
tree for full-site comparisons and 5 rounds of normalization is not
given because there is only one dendrogram that qualifies.
It's worthwhile to note that more dendrograms were used to build
the consensus tree of figure \ref{consensus-5-1000} than were used in
figures \ref{consensus-1-1000} and \ref{consensus-1-full}. Despite this, figure
\ref{consensus-5-1000} retains much more detail, indicating that its
constituent dendrograms, based on 5 rounds of normalization,
agree more than those with only 1 round of normalization.
The consensus trees are also grouped into clusters, which are then
mapped in figures \ref{map-consensus-1-1000} --
\ref{map-consensus-5-1000}. The outline map of Sweden were provided by
Therese Leinonen and are the same as those in \namecite{leinonen08}. The
L04 package from the University of Groningen was used to map the
consensus trees onto the map of Sweden; the multi-dimensional scaling
maps and composite cluster maps also used L04.
% TODO: CITE this, I think it's a Pieter Klieweg paper
\begin{figure}
\includegraphics[scale=0.7]{consensus-1-1000}
% \Tree[. {Villberga\\Viby\\Vaxtorp\\Torso\\Tors\aa{}s\\StAnna\\Sproge\\Sorunda\\Skinnskatteberg\\Segerstad\\Ossjo\\Orust\\Norra Rorum\\Loderup\\Leksand\\K\"ola\\Jamshog\\Indal\\Frillesas\\Fole\\Faro\\Bredsatra\\Boda\\Bara\\Asby\\Arsunda\\Anundsjo\\Ankarsrum} [. {Floby\\Bengtsfors} ] ]
\caption{Consensus Tree for 1000-samples and 1 normalization}
\label{consensus-1-1000}
\end{figure}
\begin{figure}
\includegraphics[scale=0.7]{consensus-1-full}
% \Tree[. {Villberga\\Viby\\Torso\\Tors\aa{}s\\Sorunda\\Segerstad\\Ossjo\\Orust\\Norra Rorum\\Loderup\\Leksand\\K\"ola\\Indal\\Fole\\Boda\\Bara\\Asby\\Arsunda\\Anundsjo\\Ankarsrum}
% [. {Vaxtorp\\Skinnskatteberg} ]
% [. {StAnna\\Frillesas} ]
% [. {Sproge\\Faro} ]
% [. {Jamshog\\Bredsatra} ]
% [. {Floby\\Bengtsfors} ] ]
\caption{Consensus Tree for full site comparison and 1 normalization}
\label{consensus-1-full}
\end{figure}
\begin{figure}
\includegraphics[scale=0.7]{consensus-5-1000}
% \Tree[. {Villberga\\Viby\\Vaxtorp\\Torso\\StAnna\\Sproge\\Sorunda\\Skinnskatteberg\\Segerstad\\Orust\\Norra Rorum\\Leksand\\K\"ola\\Indal\\Frillesas\\Fole\\Floby\\Faro\\Boda\\Bengtsfors\\Bara\\Asby\\Arsunda\\Anundsjo\\Ankarsrum} [. {Loderup\\Bredsatra} ] [. {Tors\aa{}s\\Ossjo\\Jamshog} ] ]
\caption{Consensus Tree for 1000-samples and 5 normalizations}
\label{consensus-5-1000}
\end{figure}
\begin{figure}
\includegraphics[scale=0.85]{Sverigekarta-Landskap-consensus-1-1000}
\caption{Consensus Tree for 1000-samples and 1 normalization, Mapped}
\label{map-consensus-1-1000}
\end{figure}
\begin{figure}
\includegraphics[scale=0.85]{Sverigekarta-Landskap-consensus-1-full}
\caption{Consensus Tree for full site comparison and 1 normalization, Mapped}
\label{map-consensus-1-full}
\end{figure}
\begin{figure}
\includegraphics[scale=0.85]{Sverigekarta-Landskap-consensus-5-1000}
\caption{Consensus Tree for 1000-samples and 5 normalizations, Mapped}
\label{map-consensus-5-1000}
\end{figure}
% It would still be cool to eliminate only the non-significant distances
% and re-run the clusters. (I can't remember if that's easily possible
% with R though, it may only be a feature of MDS.)
\subsubsection{Analysis}
The cluster dendrograms are dangerous to interpret too closely on
their own; the instability of a single dendrogram means that small
clusters cannot be analyzed reliably. For example, in figure
\ref{cluster-5-js-trigram}, a two-way split between the sites on the
top and bottom of the page is obvious, and another in the top cluster
is easy to argue for, but outliers like Anundsj\"o and \AA{}rsunda are
likely to shift from group to group in other dendrograms.
It is safer to analyze the consensus trees; the smoothing effect of
taking the majority rule of each cluster will show where the optimal
cutoff for splitting clusters is removing spurious detail. The three
consensus trees in figures \ref{consensus-1-1000} --
\ref{consensus-5-1000} vary in amount of detail but the trees with
more clusters do not contradict the clusters of the flatter trees.
For 1000-sentence samples and 1 round of normalization, there is one
cluster: Floby and Bengtsfors. Full-site comparison finds
another cluster: J\"amshog, \"Ossj\"o and Tors\aa{}s. Finally,
1000-sentence samples and 5 rounds of normalization finds another
cluster consisting of L\"oderup and Breds\"atra. It also finds
a large two-way split between the sites and adds Sproge to the first
cluster with Floby and Bengtsfors. To aid further analysis, the
clusters are assigned colors, which are detailed in figures
\ref{blue-cluster} -- \ref{orange-cluster}.
\begin{figure}
\begin{itemize}
\item Floby
\item Bengtsfors
\item Sproge (for 1000-sample, 5-normalization)
\end{itemize}
\caption{Blue Cluster}
\label{blue-cluster}
\end{figure}
\begin{figure}
\begin{itemize}
\item J\"amsh\"og
\item Tors\aa{}s
\item \"Ossj\"o
\end{itemize}
\caption{Red Cluster}
\label{red-cluster}
\end{figure}
\begin{figure}
\begin{itemize}
\item Breds\"atra
\item L\"oderup
\end{itemize}
\caption{Yellow Cluster}
\label{yellow-cluster}
\end{figure}
\begin{figure}
\begin{itemize}
\item Leksand
\item Indal
\item Segerstad
\item Floby
\item Bengtsfors
\item Sproge
\item Skinnskatteberg
\item Orust
\item V\aa{}xtorp
\item F\aa{}r\"o
\item Asby
\item \AA{}rsunda
\item Anundsj\"o
\item Ankarsrum
\item Fole
\end{itemize}
\caption{Cyan Cluster}
\label{cyan-cluster}
\end{figure}
\begin{figure}
\begin{itemize}
\item Viby
\item Bara