-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathmethods.tex
1737 lines (1511 loc) · 79.3 KB
/
methods.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\chapter{Methods}
\label{methods-chapter}
This chapter contains four sections. The first discusses related work,
dialectometry since its development in the middle of the 20th century,
starting with \namecite{seguy73} and continuing with
\namecite{goebl06}. The second section discusses the previous work on
statistical methods for syntactic dialectometry, as well as the
feature sets and distance measures developed in this dissertation. The
third and fourth sections cover the application of the work to Swedish
dialects. Specifically, the third section deals with input analysis:
which Swedish corpora were used and which annotators served as a basis for
extracting features. The fourth section covers output analysis,
detailing the methods used to process the distances between interview
sites so that they can be compared to the dialectology literature.
\section{Related Work}
\subsection{S\'eguy}
Measurement of linguistic similarity has always been a part of
linguistics. However, until \namecite{seguy73} dubbed a new set of
approaches `dialectometry', these methods lagged behind the rest of
linguistics in formality. S\'eguy's quantitative analysis
of Gascogne French, while not aided by computer, was the predecessor
of more powerful statistical methods that essentially required the use
of computer as well as establishing the field's general dependence on
well-crafted dialect surveys that divide incoming data along
traditional linguistic boundaries: phonology, morphology, syntax, etc.
This makes both collection and analysis easier, although it requires
more work to combine separate analyses to produce a complete picture of dialect
variation.
The project to build the Atlas Linguistique et Ethnographique de la
Gascogne, which S\'eguy directed, collected data in a dialect survey
of Gascogne which asked speakers questions informed by different areas
of linguistics. For example, the pronunciation of `dog' ({\it chien})
was collected to measure phonological variation. It had two common
variants and many other rare ones: [k\~an], [k\~a], as well as [ka],
[ko], [kano], among others. These variants were, for the most part,
% or hat "chapeau": SapEu, kapEt, kapEu (SapE, SapEl, kapEl
known by linguists ahead of time, but their exact geographical
distribution was not.
The atlases, as eventually published, contained not only annotated
maps, but some analyses as well. These analyses were what S\'eguy named
dialectometry. Dialectometry differs from previous attempts to find
dialect boundaries in the way it combines information from the
dialect survey. Previously, dialectologists found isogloss
boundaries for individual items. A dialect boundary was generated when
enough individual isogloss boundaries coincided. However, for any real
corpus, there is so
much individual variation that only major dialect boundaries can
be captured this way.
S\'eguy reversed the process. He first combined survey data to get
a numeric score between each site. Then he posited dialect boundaries
where large distances resulted between sites. The difference is
important, because a single numeric score is easier to
analyze than hundreds of individual boundaries.
Much more subtle dialect boundaries are visible this way; where before
one saw only a jumble of conflicting boundary lines, now one sees
smaller, but consistent, numerical differences separating regions. {Dialectometry
enables classification of gradient dialect boundaries, since now one
can distinguish weak and strong boundaries. Previously, weak
boundaries were too uncertain.}
However, S\'eguy's method of combination is simple both
linguistically and mathematically. When comparing two sites, any
difference in a response is counted as 1. Only identical
responses count as a distance of 0. Words are not analyzed
phonologically, nor are responses weighted by their relative amount
of variation. Finally, only geographically adjacent sites are
compared. This is a reasonable restriction, but later studies were
able to lift it because of the availability of greater computational
power. Work following S\'eguy's improves on both aspects. In
particular, Hans Goebl developed dialectometry models that are
more mathematically sophisticated, while retaining the survey-syle small feature set.
\subsection{Goebl}
Hans Goebl emerged as a leader in the field of dialectometry,
formalizing the aims and methods of dialectometry. His primary
contribution was development of various methods to combine individual
distances into global distances and global distances into global clusters. These
methods were more sophisticated mathematically than previous
dialectometry and operated on any features extracted from the data. His
analyses have used primarily the Atlas Linguistique de Fran\c{c}ais.
\namecite{goebl06} provides a summary of his work. Most relevant
are the measures Relative Identity Value and Weighted
Identity Value. They are general methods that are the basis for nearly
all subsequent fine-grained dialectometrical analyses. They have three
important properties. First, they are independent of the source
data. They can operate over any linguistic data for which they are
given a feature set, such as the one proposed by \namecite{gersic71} for
phonology. Second, they can compare data even for items that do not
have identical feature sets, unlike Ger\v{s}i\'c's measure $d$, for example,
which cannot compare consonants and vowels. Third, they can compare
data sets that are missing some entries. This improves on S\'eguy's
analysis by providing a principled way to handle missing survey
responses.
Relative Identity Value, when comparing any two items, counts the
number of features which share the same value and then discounts
(lowers) the importance of the result by the number of unshared
features. The result is a single percentage that indicates relative
similarity. This percentage, when measured between all pairs of sites
in a corpus, can be scaled to produce a dissimilarity. Note that the
presentation below splits Goebl's original equations into more
manageable pieces; the high-level equation for Relative Identity Value
is:
\begin{equation}
\frac{\textrm{identical}_{jk}} {\textrm{identical}_{jk} - \textrm{unidentical}_{jk}}
\label{riv}
\end{equation}
For some items being compared $j$ and $k$. In this case
\textit{identical} is
\begin{equation}
\textrm{identical}_{jk} = |f \in \textrm{\~N}_{jk} : f_j = f_k|
\end{equation}
where $\textrm{\~N}_{jk}$ is the set of features shared by $j$ and
$k$. In other words, of the total universe of features $N$, both $j$
and $k$ must contain the feature for it to be included in
$\textrm{\~N}_{jk}$. So if a feature occurs only in $j$ but not in
$k$, it will be included in $N$, but not in $\textrm{\~N}_{jk}$. This
ensures that the comparison $f_j = f_k$ is always valid, where $f_j$
and $f_k$ are the value of some feature $f$ for $j$ and $k$
respectively. \textit{unidentical} is defined similarly, except that
it counts all features N, not just the shared features
$\textrm{\~N}_{jk}$. Here, features that occur in only $j$ or only $k$
contribute toward \textit{unidentical}'s total.
\begin{equation}
\textrm{unidentical}_{jk} = |f \in \textrm{N} : f_j \neq f_k|
\end{equation}
Weighted Identity Value (WIV) is a refinement of Relative Identity
Value. This measure defines some differences as more
important than others. In particular, feature values that only occur
in a few items give more information than feature values that appear
in a large number of items. \quotecite{wiersma09} normalization,
covered at the end of this chapter, reuses this idea for feature
ranking.
The reasoning behind this idea is fairly simple. Goebl
is interested in feature values that occur in only a few items. If a
feature has some value that is shared by all of the items, then all
items belong to the same group. This feature value provides {\it no}
useful information for distinguishing the items. The situation
improves if all but one item share the same value for a feature; at
least there are now two groups, although the larger group is still not
very informative. The most information is available if each item
being studied has a different value for a feature; the items fall
trivially into singleton groups, one per item.
Equation \ref{wiv-ident} implements this idea by discounting
the \textit{identical} count from equation \ref{riv} by
the amount of information that feature value conveys. The
amount of information, as discussed above, is based on the number of
items that share a particular value for a feature. If all items share
the same value for some feature, then \textit{identical} will be discounted all the
way to zero--the feature conveys no useful information.
Weighted Identical Value's equation for \textit{identical} is
therefore
\begin{equation}
\textrm{identical} = \sum_f \left\{
\begin{array}{ll}
0 & \textrm{if} f_j \neq f_k \\
1 - \frac{\textrm{agree}f_{j}}{(Ni)w} & \textrm{if} f_j = f_k
\end{array} \right.
\label{wiv-ident}
\end{equation}
\noindent{}The complete definition of Weighted Identity Value is
\begin{equation} \sum_i \frac{\sum_f \left\{
\begin{array}{ll}
0 & \textrm{if} f_j \neq f_k \\
1 - \frac{\textrm{agree}f_j} {(Ni)w} & \textrm{if} f_j = f_k
\end{array} \right.}
{\sum_f \left\{
\begin{array}{ll}
0 & \textrm{if} f_j \neq f_k \\
1 - \frac{\textrm{agree}f_j} {(Ni)w} & \textrm{if} f_j = f_k
\end{array} \right. - |f \in \textrm{N} : f_j \neq f_k|}
\label{wiv-full}
\end{equation}
\noindent{}where $\textrm{agree}f_{j}$ is the number of items that
agree with item $j$ on feature $f$ and $Ni$ is the total number of
items ($w$ is the weight, discussed below). Because of the piecewise
definition of \textit{identical}, this number is always at least $1$
because $f_k$ agrees already with $f_j$. This equation takes the
count of shared features and weights them by the size of the sharing
group. The features that are shared with a large number of other
items get a larger fraction of the normal count subtracted. WIV is
similar to entropy from information theory, which forms the basis of
the Kullback-Leibler and Jensen-Shannon divergences described later
in this chapter \cite{lin91}. The difference is that WIV subtracts
values from 1 to make common features less important, while entropy
takes the logarithm. The result is similar, but the two divergences
are theoretically more principled in directly referring to information theory.
% TODO: CITE Shannon. Seriously, how did I avoid it?
For example, let $j$ and $k$ be sets of productions for the
underlying English segment /s/. The allophones of /s/ vary mostly on the feature
\textit{voice}. Seeing an unvoiced [s] for /s/ is less ``surprising'' than
seeing a voiced [z], so the discounting process should
reflect this. For example, assume that an English corpus contains 2000
underlying /s/ segments. If 500 of them are realized as [z], the
discounting for \textit{voice} will be as follows:
\begin{equation}
\begin{array}{c}
identical_{/s/\to[z]} = 1 - 500/2000 = 1 - 0.25 = 0.75 \\
identical_{/s/\to[s]} = 1 - 1500/2000 = 1 - 0.75 = 0.25
\end{array}
\label{wiv-voice}
\end{equation}
Each time /s/ surfaces as [s], it only receives 1/4 of a point
toward the agreement score when it matches another [s]. When /s/
surfaces as [z], it receives three times as much for matching
another [z]: 3/4 points towards the agreement score. If the
alternation is even more weighted toward faithfulness, the ratio
changes even more; if /s/ surfaces as [z] only 1/10 of the time,
then [z] receives 9 times more value for matching than [s] does.
The final value, $w$, which is what gives the name ``weighted
identity value'' to this measure, provides a way to control how much
is discounted. A high $w$ will subtract more from uninteresting
groups, so that \textit{voice} might be worth less than
\textit{place} for /t/ because /t/'s allophones vary more over
\textit{place}. In equation \ref{wiv-voice}, $w$ is left at 1 to
facilitate the presentation, but typically it is used like an ad-hoc
equivalent of information gain: the linguist can give more weight to
features that are believed to be salient.
\section{Dialectometry}
\label{methods-chapter-dialectometry-section}
It is at this point that the two types of analysis, phonological and
syntactic, diverge. Although Goebl's techniques are general enough to
operate over any set of features that can be extracted, better results
can be obtained by specializing the general measures above to take
advantage of properties of the input. Specifically, the application
of computational linguistics to dialectometry beginning in the 1990s
introduced methods from other fields. These methods, while generally
giving more accurate results quickly, are tied to the type of data on
which they operate.
Currently, the dominant phonological distance measure is Levenshtein
distance. This distance is essentially the count of differing
segments, although various refinements have been tried, such as
inclusion of distinctive features or phonetic
correlates. \namecite{heeringa04} gives an excellent analysis of the
applications and variations of Levenshtein distance. He investigated
varying levels of detail and differing feature sets. Interestinly,
although he extracted features from phonetic correlates, phonological
(distinctive) features, segments, and orthographic characters, the
more complex features failed to give any significant improvement over
simple segments. In addition, while Levenshtein
distance provides much information as a classifier, it is limited
because it must have a word-aligned corpus for comparison. A number of
statistical methods have been proposed that remove this requirement
such as \namecite{hinrichs07} and \namecite{sanders09}, but none have
been as successful on existing dialect resources, which are small and
are already word-aligned. New resources are not easy to develop
because the statistical methods still rely on a phonetic transcription
process.
\subsection{Syntactic Distance}
Recently, computational dialectometry has expanded to analysis of
syntax as well. The first work in this area was \quotecite{nerbonne06}
analysis of Finnish L2 learners of English, followed by
\quotecite{sanders07} analysis of British dialect areas. As explained
in chapters \ref{background-chapter} and \ref{questions-chapter},
syntax distance must be approached quite differently than phonological
distance. Syntactic corpora can be built quickly by automatically
annotating raw text, so it is easier to build a large syntactic corpus
than a phonological one; phonological annotation does not yet have a
method for automatic annotation. However, automatic annotators, while
faster, cannot compete with human annotators in quality of annotation.
This trade-off between annotation methods leads to the principal
difference between present phonological and syntactic corpora:
phonology data is word-aligned, keeping varying segments relatively
close, while syntax data is not sentence-aligned, meaning that
variation is distributed throughout the corpus. This difference leads
syntactic approaches naturally to statistical measures over large
amounts of data rather than more sensitive measures that operate on
small corpora.
\namecite{nerbonne06} were the first to use the syntactic distance
measure described below. They analyzed a corpus of Finnish L2 speakers
of English, divided by age. The first age group consisted of speakers
who learned English after childhood and the second of speakers who
learned English as children. Nerbonne \& Wiersma found a significant
difference between the two age groups. The features that were
unexpected in English contributed most to the difference; these were
associated primarily with the older age group. For example, some important
features for the older age group involved determiners, which English
has but Finnish does not. The features showed underuse of determiners,
as well as overuse, probably due to hypercorrection. Interestingly,
some of these features occur in the younger age group, but not as
often. Nerbonne \& Wiersma analyzed this pattern as interference from
Finnish; the younger age group learned English more completely with
less interference from Finnish.
My subsequent work in \cite{sanders07} and \cite{sanders08b}
expanded on the Finnish experiment in two ways. First, it introduced
leaf-ancestor paths as an alternative feature type. Second, it tested
the distance method on a larger set of sites: the Government Office
Regions of England, as well as Scotland and Wales, for a total of
11 sites. Each was smaller than the Finnish L2 age groups, so the
permutation test parameters had to be adjusted for some feature
combinations.
The distances between regions were clustered using hierarchical
agglomerative clustering, as described in section
\ref{cluster-analysis}. The resulting tree showed a North/South
distinction with some unexpected differences from previously
hypothesized dialect boundaries; for example, the Northwest region
clustered with the Southwest region. This contrasted with the
clustered phonological distances also produced in
\namecite{sanders08b}. In that experiment, there was no significant
correlation between the inter-region phonological distances and
syntactic distances.
There are several possible reasons for this lack of correlation. The
two distance measures may find different dialect boundaries based on
differences between syntax and phonology. Dialect boundaries may have
shifted during the 40 years between the collection of the SED and the
collection of the ICE-GB. One or both methods may be measuring the
wrong thing. In this dissertation, although the focus remains on results
of computational syntax distance as compared to traditional syntactic
dialectology, the discussion compares recent phonological
dialectometry results on Swedish to the results obtained here.
\subsubsection{Nerbonne and Wiersma}
\label{nerbonne06}
Due to the lack of alignment between the larger corpora available for
syntactic analysis, a statistical comparison of differences is more
appropriate than the simple symbolic approach possible with the
word-aligned corpora used in phonology. This statistical approach
means that a syntactic distance measure will have to use counting as
its basis.
\namecite{nerbonne06}'s method models syntax by part-of-speech (POS)
trigrams and uses differences between trigram type counts in a
permutation test of significance. The heart of the measure is simple:
the difference in type counts between the combined features of two
sites. \namecite{kessler01} originally proposed this measure, the
{\sc Recurrence} metric ($R$):
\begin{equation}
R = \sum_i |c_{ai} - c_{bi}|
\label{rmeasure}
\end{equation}
\noindent{}Given two sites $a$ and $b$, $c_a$ and $c_b$ are the
feature counts. $i$ ranges over all features, so $c_{ai}$ and $c_{bi}$ are the
counts of sites $a$ and $b$ for feature $i$. $R$ is designed to
represent the amount of variation exhibited by the two sites while
the contribution of individual features remains transparent to aid later
analysis. Unfortunately, it doesn't indicate whether its results are significant; a
permutation test is needed for that, described in section
\ref{permutationtest}.
\subsubsection{Dialectometry in British English}
The methods used in this dissertation are an evolution of those in my
previous work on British English: \cite{sanders07} and
\cite{sanders08b}. There, I compared phonological and syntactic
dialectometry as described above. The process is similar to Wiersma's
work in \cite{nerbonne06} and \cite{wiersma09}, but with
variants of both feature set and distance measure.
The input is 30 interview sites (described in section
\ref{syntactically-annotated-corpus}). The sentences in each site have
their features extracted (the features are described in section
\ref{syntactic-features}). Optionally, only 1000 sentences per site
are sampled with replacement, but the site sizes, unlike the British
interviews in my previous work, are fairly similar in size so this is
only required for comparison to previous work. Then the features are
counted, producing a mapping of feature types to token counts.
At this point, two sites are compared based on these feature
counts. The feature counts are first normalized to account for
variation in corpus size (described in
the next section). Then they are converted to ratios, meaning
that the counts are scaled relative to the other site. For example,
counts of 10 and 30 would produce the ratio 1 to 3, as would the
counts 100 and 300. Finally, the distance (described above in
\ref{nerbonne06}) is calculated 10 times and the result is averaged.
The sites are sampled by sentence rather than by feature because the
intent is to capture syntax, where the composite unit is the
sentence. Similarly, phonology's composite unit is the word---most
processes operate within the word on individual segments; some
processes operate between words but they are fewer. Therefore, the
assumption that words are independent will lose some information but
not the majority. In the same way, the basic unit of syntax is the
sentence; processes operate on the words in the sentence, but
inter-sentence processes are fewer. Because of this, the sites are
sampled by sentence, combining the sentences of all speakers from an
interview site.
This dissertation skips the per-speaker sampling of
\quotecite{wiersma09} work on Finnish L2 speakers. I assume that,
since discovery of dialect features is the goal of this research, the
sentences of speakers from the same village are independent of the
speaker, at least with respect to dialect features. Although the
motivation is partly theoretical, there is also a difference between
the Swediasyn dialect corpus, with 2--4 speakers for each of 30 sites,
and Wiersma's L2 corpus, with dozens of speakers but only two
groups. Sampling per-speaker would not be feasible for the Swediasyn
because there aren't enough speakers per village.
\subsubsection{Normalization}
\label{normalization}
The two sites being compared can differ in size, even if the
samples contain the same number of sentences; if one site contains many
long sentences and the other contains many short ones, raw counts
will favor the features extracted from the long sentences simply
because each sentence yields more features. Additionally, the counts
are converted to ratios to ignore the effect of
frequency---in effect, this ranks features only by how much they differ
between the two sites, ignoring the question of how often they occur
relative to the other features extracted from the two sites. That is,
a high ratio for a rare feature that happens only ten times in both
sites is just as important as a high ratio for a common feature that
happens thousands of times.
The first normalization normalizes the counts for each feature within
the pair of sites $a$ and $b$. The purpose is to normalize the
difference in sentence length, where longer sentences with more words
cause features to be relatively more frequent than sites with many
short sentences. Each feature count $i$ in a vector, for example
$a$, is converted to a frequency $f_i$ \[f_i=\frac{i}{N} \] where
$N$ is the length of $a$. For two sites $a$ and $b$ this produces
two frequency vectors, $f_{a}$ and $f_{b}$.Then the original counts in
$a$ and $b$ are redistributed according to the frequencies in
$f_a$ and $f_b$:
\[a'_{i} = \frac{f_{ai}(a_{i}+b_{i})}{f_{ai}+f_{bi}},
b'_{i} = \frac{f_{bi}(a_{i}+b_{i})}{f_{ai}+f_{bi}}\]
This redistributes the total of a pair from $a$ and $b$ based on
their relative frequencies. In other words, the total for each feature
remains the same:
\[ a_{i} + b_{i} = a'_{i} + b'_{i} \]
but the values of $a'_{i}$ and $b'_{i}$ are scaled by their frequency
within their respective vectors.
For example, assume that the two sites have 10 sentences each, with a
site $a$ with only 40 words and another, $b$, with 100 words. This
results in $N_a = 40$ and $N_b = 100$. Assume also that there is a
feature $i$ that occurs in both: $a_{i} = 8$ and $b_{i} = 10$. This
means that the relative frequencies are $f_{ai} = 8/40 = 0.2$ and
$f_{bi} = 10/100 = 0.1$. The first normalization will redistribute the
total count ($10 + 8 = 18$) according to relative frequencies. So
\[a'_i = \frac{0.2(18)}{0.2+0.1} = 3.6 / 0.3 = 12\] and
\[b'_i = \frac{0.1(18)}{0.2+0.1} = 1.8 / 0.3 = 6\] Now that 8 has
been scaled to 12 and 10 to 6, the fact that site $b$ has more words
has been normalized. This reflects the intuition that something that
occurs 8 of 40 times is more important than something that occurs 10
of 100 times.
% this is the (*2n / N) bit
The second normalization normalizes all values in both permutations
with respect to each other. This is simple: find the average number of
times each feature appears, then divide each scaled count by it. This
produces numbers whose average is 1.0 and whose values are multiples
of the amount that they are greater than the average. The average
feature count is $N / 2n$, where $N$ is the number of feature
occurrences and $n$ is the number of feature
types in the combined sites. Division by two is necessary since we are multiplying counts
from a single permutation by summed counts from the combined sitesboth
permutations. Each entry in the ratio vector now becomes
\[ r_{ai} = \frac{2na'_i}{N}, r_{bi} = \frac{2nb'_i}{N}\]
For example, given the previous example numbers, this second
normalization first finds the average. Assuming 5 unique features for
$a$'s 40 total features and 30 for $b$'s total 100 features gives
\[n = 5 + 30 = 35\] and
\[N = 40 + 100 = 140\] Therefore, the average feature has $140 / 2(35)
= 2$ occurrences in $a$ and $b$ respectively. Dividing $a'_i = 12$ and
$b'_i = 6$ by this average gives $r_{ai} = 6$ and $r_{bi} = 3$. In
other words, $r_{ai}$ occurs 6 times more than the average feature.
Together, these normalizations control for the effect of variation in
sentence length (the first normalization), corpus size (the second
normalization), and relative overuse (the second
normalization). Furthermore, the normalizations can be iterated, with
the normalized output further re-normalized. This exaggerates the
differntiating effect of the normalization, which allows distance
measures to be more sensetive to feature count variations.
\subsection{Syntax Features}
\label{syntactic-features}
In order to answer question 1, whether the distance measure agrees
with dialectology, a distance measure such as $R$ needs features that
capture the dialect syntax of the interview corpus given as
input. Following \namecite{nerbonne06}, I start with parts of speech,
then add the leaf-ancestor paths from my work on the ICE-GB
\cite{sanders07}, and finally add leaf-head paths and
phrase-structure rules, as well as variants on these features. These
feature sets each depend on a different type of automatic annotation,
which is described in section \ref{parsers}.
\namecite{nerbonne06} argue that POS trigrams can accurately represent
at least the important parts of syntax, similar to the way chunk
parsing can capture the most important information about a
sentence. If this is true, POS trigrams are a good starting point for
a language model; they are simple and easy to obtain in a number of
ways. They can either be generated by a tagger as Nerbonne
and Wiersma did, or taken from the leaves of the trees of a
syntactically annotated corpus as I did with the
International Corpus of English \cite{sanders07}.
Of course, bigrams are a possible feature since they are so similar to
trigrams. I do not use them here for several reasons. First, previous
work uses trigrams, so trigrams are needed in order to remain
comparable. But bigrams offer only reduced sparseness and noise
reductions compared to trigrams. However, neither feature sparseness
nor noise is a problem for trigrams when used with the distance
measures developed here, as will be seen in the results in chapter
\ref{results-chapter}.
On the other hand, if syntax is in fact a phenomenon that involves
hidden structure above the visible words of the sentence, a feature
set should be constructed to capture that
structure. \quotecite{sampson00} leaf-ancestor paths provide one way
to do this: for each leaf in the parse tree, leaf-ancestor paths
produce the path from that leaf back to the root. Generation is simple
as long as every sibling is unique. For example, the parse tree
\Tree[.S [.NP [.Det the ] [.N dog ] ] [.VP [.V barks ] ] ]
creates the following leaf-ancestor paths:
\begin{itemize}
\item S-NP-Det-the
\item S-NP-N-dog
\item S-VP-V-barks
\end{itemize}
For identical siblings, brackets must be inserted in the path to
disambiguate the first sibling from the second.
There is one path for each word, and the root appears
in all four. However, there can be ambiguities if some
node happens to have identical siblings. Sampson gives the example
of the two trees
\Tree[.A [.B p q ] [.B r s ] ]
and
\Tree[.A [.B p q r s ] ]
which would both produce
\begin{itemize}
\item A-B-p
\item A-B-q
\item A-B-r
\item A-B-s
\end{itemize}
There is no way to tell from the paths which leaves belong to which
B node in the first tree, and there is no way to tell the paths of
the two trees apart despite their different structure. To avoid this
ambiguity, Sampson uses a bracketing system; brackets are inserted
at appropriate points to produce
\begin{itemize}
\item $[$A-B-p
\item A-B]-q
\item A-[B-r
\item A]-B-s
\end{itemize}
and
\begin{itemize}
\item $[$A-B-p
\item A-B-q
\item A-B-r
\item A]-B-s
\end{itemize}
Left and right brackets are inserted: at most one
in every path. A left bracket is inserted in a path containing a leaf
that is a leftmost sibling and a right bracket is inserted in a path
containing a leaf that is a rightmost sibling. The bracket is inserted
at the highest node for which the leaf is leftmost or rightmost.
It is a good exercise to derive the bracketing of the previous two trees in detail.
In the first tree, with two B
siblings, the first path is A-B-p. Since $p$ is a leftmost child,
a left bracket must be inserted, at the root in this case. The
resulting path is [A-B-p. The next leaf, $q$, is rightmost, so a right
bracket must be inserted. The highest node for which it is rightmost
is B, because the rightmost leaf of A is $s$. The resulting path is
A-B]-q. Contrast this with the path for $q$ in the second tree; here $q$
is not rightmost, so no bracket is inserted and the resulting path is
A-B-q. $r$ is in almost the same position as $q$, but reversed: it is the
leftmost, and the right B is the highest node for which it is the
leftmost, producing A-[B-r. Finally, since $s$ is the rightmost leaf of
the entire sentence, the right bracket appears after A: A]-B-s.
At this point, the alert reader will have
noticed that both a left bracket and right bracket can be inserted for
a leaf with no siblings since it is both leftmost and rightmost. That is,
a path with two brackets on the same node could be produced: A-[B]-c. Because
of this redundancy, single children are
excluded by the bracket markup algorithm. There is still
no ambiguity between two single leaves and a single node with two
leaves because only the second case will receive brackets.
% See for yourself:
% \[\xymatrix{
% &\textrm{A} \ar@{-}[dl] \ar@{-}[dr] &\\
% \textrm{B} \ar@{-}[d] &&\textrm{B} \ar@{-}[d] \\
% \textrm{p} && \textrm{q} \\
% }
% \]
% \[\xymatrix{
% &\textrm{A} \ar@{-}[d] &\\
% &\textrm{B} \ar@{-}[dl] \ar@{-}[dr] & \\
% \textrm{p} && \textrm{q} \\
% }
% \]
% \cite{sampson00} also gives a method for comparing paths to obtain an
% individual path-to-path distance, but this is not necessary for the
% permutation test, which treats paths as opaque symbols.
Sampson originally developed leaf-ancestor paths as an improved
measure of similarity between gold-standard and machine-parsed trees,
to be used in evaluating parsers. The underlying idea of a collection
of features that capture distance between trees transfers quite nicely
to this application. I replaced POS trigrams with leaf-ancestor paths
for the ICE corpus and found improved results on smaller sites than
Nerbonne and Wiersma had tested \cite{sanders07}. The additional
precision that leaf-ancestor paths provide appears to aid in attaining
significant results.
\subsubsection{Leaf-Head Paths}
\label{leaf-head-paths}
For dependency parses, it is easy to create a variant of leaf-ancestor
paths called ``leaf-head paths''. Like leaf-ancestor paths, each word
in the sentence is associated with a single leaf-head path. The
difference is that the path is from the leaf to the head of the sentence via the
intermediate heads. For example, the same sentence, ``The dog barks'',
produces the following leaf-head paths, given the dependency parse in
figure \ref{example-dep-parse}:
\begin{figure}
\[\xymatrix{
& & root \\
DET \ar@/^/[r]^{DT} & NP\ar@/^/[r]^{SS} & V \ar@{.>}[u] \\
The & dog & barks
}
\]
\caption{Dependency parse for ``The dog barks.''}
\label{example-dep-parse}
\end{figure}
\begin{itemize}
\item root-V-N-Det-the
\item root-V-N-dog
\item root-V-barks
\end{itemize}
The biggest difference between leaf-ancestor paths and leaf-head paths
is the relative length of the paths: long
leaf-ancestor paths indicate deep nesting of structure, while short
ones indicate flatter structure. Length is a
weaker indicator of deep structure for leaf-head
paths; for example, the verb in a nested clause has a much shorter
leaf-head path than leaf-ancestor path, but its dependents have
comparable lengths between the two types of paths. Instead, length of
path measures centrality to the sentence; longer leaf-head paths
indicate less important words.
Leaf-head paths represent a compromise between leaf-ancestor paths and
trigrams. Like trigrams, they capture lexical context, but the context
is based on head dependencies, so long-distance context is
possible. Like leaf-ancestor paths, they capture information about the
nested structure of the sentence, although not as completely or
explicitly.
\subsection{Alternate Feature Sets}
\label{alternate-feature-sets}
This section describes the variants besides the main feature sets
already described above: trigrams, leaf-ancestor paths and leaf-head
paths. Most are variants on these three main sets.
\subsubsection{Part-of-Speech Unigrams}
Part-of-speech unigrams are single parts of speech. Unlike POS
trigrams, they do not capture context or order, only distributional
differences. In this dissertation, they serve as a baseline since they
are not expected to capture syntactic variation as much as the other
feature sets.
\subsubsection{Phrase Structure Rules}
Phrase structure rules are extracted from the same parses as
leaf-ancestor paths, but instead of capturing a series of parent-child
relations, it captures single-level parent-child-sibling
relations. For example, given the tree in figure
\ref{psg-example-tree} the extracted rules are given in
figure \ref{psg-example}.
\begin{figure}
\Tree[.S [.NP [.Det the ] [.N dog ] ] [.VP [.V barks ] ] ]
\caption{Example Tree}
\label{psg-example-tree}
\end{figure}
\begin{figure}
\begin{tabular}{ccc}
\Tree[.S NP VP ] & \Tree[.NP Det N ] & \Tree[.VP V ] \\
\end{tabular}
\caption{Phrase-Structure Rules Extracted}
\label{psg-example}
\end{figure}
Phrase structure rules are most similar to leaf-ancestor paths in
emphasizing the hidden, parse structure of constituency parse trees. Unlike
leaf-ancestor paths, they capture some context to the left and
right. They also only cover one level in the tree, whereas
leaf-ancestor paths traverse it from leaf to root. Phrase structure
rules have the possibility to be useful in sentences where context is
important, but they also depend on having accurate parses even at the
top of the tree. This is difficult for automatic parsers to achieve.
\subsubsection{Grandparent Phrase Structure Rules}
Grandparent phrase structure rules are a variant of phrase structure
rules that include the grandparent as well. Given the tree in figure
\ref{psg-example-tree}, the extracted features are given in
figure \ref{grand-psg-example}.
\begin{figure}
\begin{tabular}{ccc}
\Tree[.ROOT [.S NP VP ] ] & \Tree[.S [.NP Det N ] ] &
\Tree[.S [.VP V ] ] \\
\end{tabular}
\caption{Grandparent Phrase-Structure Rules Extracted}
\label{grand-psg-example}
\end{figure}
Grandparent phrase structure rules add some of the vertical information present in
leaf-ancestor paths, hopefully without introducing data sparseness
problems. However, they retain the advantage over
leaf-ancestor paths of capturing left and right context.
\subsubsection{Arc-Head Paths}
As described in section \ref{leaf-head-paths}, the usual labels for
leaf-head paths are the leaves of the tree: `root-V-N-Det-the' is the
first leaf-head path for ``The dog barks'', which has the parts of
speech ``Det N V''. However, one can
also use the arc labels of the dependency parse to create arc-head
paths. These paths have the same shape as their corresponding
leaf-head paths, but use the labels of the dependency arcs between
words instead of the parts of speech of the words themselves.
The sentence for the leaf-head example is given in figure
\ref{example-dep-parse}, and the resulting arc-head paths are
\begin{itemize}
\item root-SS-DT-the
\item root-SS-dog
\item root-barks
\end{itemize}
% TODO: To the future work section, add:
% 1. Extract deps from CFG parses from Berkeley
% 2. Label dep features with both arc and POS tags interleaved in the
% proper order.
% 3. Better tag set. (duh, probably already have this one)
% 4. Non-linear feature set combination.
\subsubsection{Tags from Berkeley Parser}
The Berkeley parser \cite{petrov06}, as described in section
\ref{parsers}, can either tag incoming sentences with its own part of
speech tagger, using the same splitting process as the rest of the
parser, or with parts of speech specified externally. In this case,
the external part-of-speech tagger is T`n'T \cite{brants00}. Although
the Berkeley-generated POS tags are not as good, it may be useful to
see how they change the overall results---although it seems that
accurate parts of speech are required for good features to be
generated, it is useful to see how much the results degrade when given
lower quality parts of speech. Note that the Berkeley-generated POS
tags are used to generate trigrams and leaf-head paths, by taking the
POS tags and feeding them into MaltParser.
To get the Berkeley parser to generate parts of speech, it is given
the interview sites directly, skipping the tagging by T`n'T. Using
the same method it uses to parsing the higher structure of the
sentence, it also tags words with parts of speech. After parsing,
these parts of speech are extracted from the leaves of the parse
trees. First, the parts of speech are used to create trigram
features. Second, the parts of speech are given to MaltParser for
dependency parsing. This produces dependency parses based on parts of
speech from the Berkeley parser. The result is trigrams and leaf-head
paths based on Berkeley-generated POS tags instead of T`n'T-generated
POS tags.
\subsubsection{Dependencies from alternate MaltParser training}
Since MaltParser uses Nivre's oracle-based dependency parsing
algorithm, the default oracle, based on support vector machines, can
be replaced with Timbl, the Tilburg Memory-Based Learner. It is
possible that a memory-based learner improves parsing
because support vector machines depend on large
training corpora to provide good results. In contrast, a memory-based
learner can obtain good results on limited training if the training
happens to be representative and the right combination of parameters
can be found for Timbl.
This is, however, somewhat complicated since Timbl is quite sensitive
to parameter changes and usually requires specific tuning for
particular tasks. To find the best
parameters, I use a manual search across a number of the major
distance measures provided by Timbl, as well as fallback-combinations
from more complicated distance measures to less complicated ones.
Each combination was evaluated with ten-fold cross-validation on
Talbanken. The best combination was Jeffrey divergence with 5 nearest
neighbors, no feature weighting, inverse distance neighbor
weighting, and fallback to the Overlap metric for fewer than two
neighbors. Jeffrey divergence is a symmetric variant of
Kullback-Leibler divergence, also described in section
\ref{kl-divergence}. These parameter settings were used as a basis for
parsing and generation of leaf-ancestor paths.
% \subsubsection{Within-clause Dependency/Leaf-ancestor paths}
% I haven't done this. This is interview data and there are not many
% nested clauses---probably less than 1 in 3 and I don't think it would make much
% difference. Besides, it would be difficult to specify a set of
% criteria for cutting off within-a-clause---simply removing everything
% between the root and the first S would miss some nested clauses.
% All right, so maybe you really could just use parts of speech to
% tell. Even if it worked, I still don't think there would be much
% difference because very few of the important features without
% within-clause cutoff have multiple clauses--most are simple and
% non-nested. Oh, right. Simplifying to ignore clause nesting would
% increase the power of phenomena that happen regardless of nesting
% level but don't occur enough to be visible otherwise.
\subsection{Combining Feature Sets}
\label{combine-feature-sets}
Combining feature sets gives the classifier more information about a
site by combining the information that each feature set
captures. This dissertation uses a simple linear combination. In other
words, all features are counted together with equal
weight. This is easy and should allow the feature ranker
to find a greater variety of features that capture the
same underlying syntactic information.
% I don't think I'll do this (I probably won't even do the Volk thing
% since it involves a lot of parameterisation in icecore.h.
% Well, I *could* instead replicate each feature by its weight
% before combining, but this would probably make classification super
% slow because of the huge number of features.)
% ----
% Combinations of feature types will be ranked by
% averaging the number of significant distances that the constituent
% feature types produce.
\subsection{Alternate Distance Measures}
\label{alternate-distance-measures}
There are several reasons to test distance measures besides
$R$. There are a couple of a priori reasons for this: $R$ is fairly
simple, so more complicated variations on it may provide better
sensitivity at the expense of sensitivity to noise. Also, variations
explore the measure space better in case that $R$ is not significant
for some combination of corpus/feature set.
Post-hoc, there are interesting patterns of statistical significance
produced by the combination of distance measure and feature set. These
patterns are not trivially obvious. This is not expected, but may
provide insight into the measure/feature combination, which helps
resolving Hypotheses 1 and 2.
% Another possibility is a return to Goebl's Weighted Identity Value;
% this classifier is similar in some ways to $R$, but has not been
% tested with large corpora, to my knowledge at least.
\subsubsection{Kullback-Leibler divergence}
\label{kl-divergence}
Kullback-Leibler divergence, or relative entropy, is described in
\namecite{manningschutze}. Relative entropy is similar to $R$ but
more widely used in computational linguistics. The name relative
entropy implies an intuitive interpretation: it is the number of bits
of entropy incurred when compressing a site $b$ with the optimal
compression scheme for a second site $a$. Unless the two sites
are identical, the relative entropy $KL(a||b)$ is non-zero because
$a$'s optimal compression scheme will over-compress $b$'s
features that are more common in $a$ than in $b$, whereas it will
under-compress features that are less common in $a$ than in $b$.
For example, assume that site $a$ has two features with type counts
\{S-NP-N : 20, S-VP-PP-N : 10\}. An optimal compression scheme for $a$
would compress S-NP-N twice as much as S-VP-PP-N because it occurs
twice as often. However, if this compression scheme is used on a
site $b$ with the feature counts \{S-NP-N : 15, S-VP-PP-N : 15\},
efficiency will be worse; S-NP-N and S-VP-PP-N occur the same number
of times in $b$, so the smaller compressed size of S-NP-N will be used
less often than expected, while the larger compressed size of
S-VP-PP-N will be used more. This difference can be measured precisely
for each feature:
\[ a_i \log\frac{a_i}{b_i} \]
where $a_i$ is type count of the $i$th feature in $a$ and $b_i$
is the type count of the $i$th feature in $b$. This measures the
number of bits lost, or entropy, for each feature $i$. Like $R$'s
differences, the per-feature entropy can be summed to find the total
entropy. In the example above, the entropy for S-NP-N is $20
\log\frac{20}{15} = 5.75$.
However, Kullback-Leibler divergence as defined is a divergence: it
measures the divergence of features in the site $b$ from the
features of site $a$. A dissimilarity is required for
dialectology, which means that the divergence must additionally be
symmetric. A divergence can be made symmetric by calculating it twice:
the divergence from $a$ to $b$ added to the one from $b$ to
$a$. The complete formula is given in equation \ref{klmeasure} and
the complete example is worked in equation \ref{klexample}.
\begin{equation}
KL(a||b) = \sum_i {a_i \log\frac{a_i}{b_i} + b_i \log\frac{b_i}{a_i}}
\label{klmeasure}
\end{equation}
\begin{equation}
(20 \log\frac{20}{15} + 15 \log\frac{15}{20}) + (10
\log\frac{10}{15} + 15 \log\frac{15}{10}) = (5.75 - 4.32) + (-4.05 +
6.08) = 3.46
\label{klexample}
\end{equation}
\subsubsection{Jensen-Shannon divergence}
Several variants of relative entropy exist that lift various restrictions from the input
distributions. One is Jensen-Shannon divergence \cite{lin91}, which
was designed as a dissimilarity from the start. It uses the same
denominator for both directions: the average of the two
frequencies. That means that each feature's entropy is found using
the following formula:
\[ a_i \log\frac{b_i}{(a_i + b_i) / 2} + b_i \log\frac{a_i}{(a_i + b_i) / 2} \]
There is a common subexpression in this value: $(a_i + b_i) / 2$: the
average of the two features. If we let $\bar{c_i} = (a_i + b_i) / 2$
rewrite the formula to take advantage of this simplification, we get
equation \ref{jsmeasure}.
\begin{equation}
JS = \sum_i {a_i \log\frac{b_i}{\bar{c_i}} + a_i \log\frac{b_i}{\bar{c_i}}}
\label{jsmeasure}
\end{equation}
Unlike Kullback-Leibler divergence, Jensen-Shannon divergence does not
require that features exist in both sites being compared in order to
be counted. KL divergence cannot count unique features, in fact,
because if either $a_i$ or $b_i$ is zero, then it will divide by zero
at some point. The current implementation of KL divergence simply
skips zero values, which means it ignores features unique to a
particular site. Jensen-Shannon divergence avoids this problem because
it divides by $\bar{c_i}$, the average of the feature counts.
Because KL divergence ignores features unique to one site, it should
be less susceptible to noise than JS divergence. This can be useful in
the presence of unreliable annotators. However, KL divergence will
also produce less detail in the case that unique features are useful.
\subsubsection{Cosine similarity}
Cosine similarity is used in many parts of computational linguistics
and related areas such as information extraction and data
mining. \namecite{nerbonne06} use it as reference point for
comparison to previous work in these areas. Cosine similarity measures
the similarity between two high-dimensional points in space. Each
feature is modeled as a dimension, and the type count from each
site is plotted as a point on that dimension. In equation \ref{cosmeasure}, vectors $a$
and $b$ are multiplied, then divided by the product of their