-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathproposal.tex
1060 lines (944 loc) · 52 KB
/
proposal.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[11pt]{article}
% \usepackage{setspace}
\usepackage[all]{xy}
\usepackage{robbib}
\author{Nathan Sanders, Indiana University \\ \tt{[email protected]}}
\title{Syntax Distance for Dialectometry}
\begin{document}
% \doublespacing
\maketitle
% TODO: rewrite specific references to parsers to either back off to
% genericities or mention Berkeley parser instead of Collins parser.
% TODO: TnT has a 'mark unknown words' option.
\section{Introduction}
This dissertation will examine syntax distance in dialectometry using
computational methods as a basis. It is a continuation of my previous
work \cite{sanders07}, \cite{sanders08b} and earlier work by
\namecite{nerbonne06}, the first computational measure of syntax
distance. Dialectometry has existed as a field since
\namecite{seguy73} and is a sub-field of dialectology
\cite{chambers98}; recently, computational methods have come to
dominate dialectometry, but they are limited in focus compared to
previous work; most have explored phonological distance only, while
earlier methods integrated phonological, lexical, and syntactic data.
Dialectology is the study of linguistic variation. % in space / over
% distance / other variables.
Its goal is to characterize the linguistic features that
separate two language varieties. Dialectometry is a subfield of
dialectology that uses mathematically sophisticated methods to extract
and combine linguistic features. In recent years it has
been associated with computational linguistic work, most of which
has focused on phonology, starting with
\namecite{kessler95}, followed by \namecite{nerbonne97} and
\namecite{nerbonne01}. \namecite{heeringa04} provides a comprehensive
review of phonological distance in dialectometry as well as some new
methods.
In dialectometry, a distance measure can be defined in two parts:
first, a method of decomposing the linguistic data into minimal,
linguistically meaningful features, and second, a method of combining
the features in a mathematically and linguistically sound way. Figure
\ref{abstract-distance-measure-model} gives an overview of how the
model works. Input consists of two corpora; each item in each corpus
is decomposed into a set of features extracted by $f(s)$. The
resulting corpora are then compared by $d(S,T)$, which combines the
corpora into a single number: the distance.
\begin{figure}
\[\xymatrix@C=1pc{
\textrm{Corpus} \ar@{>}[d]|{f(s)} &
S = s_o,s_1,\ldots
\ar@{>}[d] % \ar@<2ex>[d] \ar@<-2ex>[d]
&&
T = t_o,t_1,\ldots
\ar@{>}[d] % \ar@<2ex>[d] \ar@<-2ex>[d]
\\
*\txt{Decomposition} \ar@{>}[d]|{d(S,T)} &
*{\begin{array}{c}
\left[ + f_o, +f_1 \ldots \right], \\
\left[ - f_o, +f_1 \ldots \right], \\
\ldots \\ \end{array}}
\ar@{>}[dr]
&&
*{\begin{array}{c}
\left[ + f_o, -f_1 \ldots \right], \\
\left[ + f_o, -f_1 \ldots \right], \\
\ldots \\ \end{array}}
\ar@{>}[dl] \\
\textrm{Combination} &
& \textrm{Distance} & \\
} \]
\label{abstract-distance-measure-model}
\caption{Abstract Distance Measure Model : $f \circ d$}
\end{figure}
Dialectometry has focused on phonological distance measures, while
syntactic measures have remained undeveloped. The most important
reason for this focus is that it is easier to define a distance
measure on phonology. In phonology, words decompose to segments and,
if necessary, segments further decompose to phonological
features. This decomposition is straightforward and based on
\namecite{chomsky68}. For combination, string alignment, or Levenshtein
distance \cite{lev65}, is a well-understood algorithm used for
measuring changes between any two sequences of characters taken from a
common alphabet. Levenshtein distance is simple mathematically, and
has the additional advantage that its intermediate data structures are
easy to interpret linguistically.
A secondary reason for dialectometry's focus on phonology is that it
is inherited from dialectology's focus on phonology.
% (TODO:Cite?).
This might be solely due to the history of dialectology as a field, but it is
likely that more phonological than syntactic differences exist between
dialects, due to historically greater standardization
of syntax via the written form of language. Phonological
dialect features are less likely to be stigmatized and suppressed by a
standard dialect than syntactic ones.
% (TODO:Cite, probably
% Trudgill and Chambers something like '98, maybe where they talk about
% what aspects of dialects are noticed and stigmatized).
Whatever the reason, much less dialectology work on syntax is
available for comparison with new dialectometry results.
\subsection{Problems}
Because of the preceding two reasons, syntax is a relatively
undeveloped area in dialectometry. Currently, the literature lacks a
generally accepted syntax measure. Unfortunately, approaching the
problem by copying phonology is not a good solution; there are real
differences between syntax and phonology that mean phonological
approaches do not apply. For example, there are fewer differences to be
found in syntax, and they occur more sparsely.
% (TODO: Back this up either with reasoning or citation).
However, dialectology has traditionally worked with fairly small
corpora. This suffices for phonology, because
it is easy to extract good features and there are many
consistent differences between corpora. For syntax, though, it is not possible
% (TODO: Weasel a bit)
to identify reliable features in small corpora.
There are two approaches that have been proposed to remedy this. The
first, proposed by \namecite{spruit08} for analyzing the Syntactic
Atlas of the Dutch Dialects \cite{barbiers05}, is to continue using
small dialectology corpora and manually extract features so that only
the most salient features are used. Then a sophisticated method of
combination such as Goebl's Weighted Identity Value (WIV), described
below and by \namecite{goebl06}, can be used to produce a
distance. WIV is more complex mathematically than Levenshtein
distance, and operates on any linguistic item. However, manual feature
extraction is not feasible in knowledge-poor or time-constrained
environments. It is also subject to bias from the
dialectologist. Since the best manual features are those that capture
the difference between two dialects, the best-known features are most
likely to become the best manual features.
This approach ignores the specific properties of the syntax distance
problem. It is easy to define features for syntactic structure. This
proposal covers part-of-speech trigrams, leaf-ancestor paths, and
dependency paths over nodes, but many variations on these features are
possible, such as lexical trigrams, lexicalized leaf-ancestor paths,
or dependency paths over dependency arc labels. Methods from other
syntactic work in computational linguistics could apply too: supertags
\cite{joshi94}, convolution kernels \cite{collins01} or any number of
simpler features such as tree height, number of nodes, or number of
words. The problem is not finding a feature set. The problem is
finding a good feature set. Small corpora hamper this search by making
statistical significance difficult to achieve, especially since
syntactic dialect differences are expected to be less frequent than
phonological ones. Fortunately, syntactic corpora are typically larger
than phonological corpora because the annotation work is easier; much
of the syntactic annotation can be generated automatically and then
corrected manually.
Even with a feature set defined, a distance measure still requires a
method of combining features. One such method, a simple statistical
measure called $R$, has been proposed by \namecite{nerbonne06} based
on work by \namecite{kessler01}. At present, however, $R$ has not been
adequately shown to detect dialect differences. A small body of work
suggests that it does, but as yet there has not been a satisfying
correlation of its results with phonology or, as with phonological
distance, with existing results from the dialectology literature on
syntax.
Nerbonne \& Wiersma's first paper used $R$ for syntax distance
together with a test for statistical significance\cite{nerbonne06}.
Their experiment compared two generations of
Norwegian L2 speakers of English, with part-of-speech trigrams as input features.
They found that the two generations were significantly
different, although they had to normalize the trigram counts to
account for differences in sentence length and complexity. However,
showing that two generations of speakers are significantly different with respect
to $R$ does not necessarily imply that the same will be true for other
types of language varieties. Specifically, for this dissertation, the
success of $R$ on generational differences does not imply success on
dialect differences.
I addressed this problem \cite{sanders08b} by measuring $R$ between
the nine Government Office Regions of England, using the International
Corpus of English Great Britain \cite{nelson02}. Speakers were classified by
birthplace. I also introduced Sampson's leaf-ancestor paths as
features \cite{sampson00}. I found statistically
significant differences between most corpora, using both trigrams and
leaf-ancestor paths as features. However, $R$'s distances were not
significantly correlated with Levenshtein distances. Nor did I
show any qualitative similarities between known syntactic dialect
features and the high-ranked features used by $R$ in producing its
distance. As a result, it is not clear whether the significant $R$ distances
correlate with dialectometric phonological distance or with known
features found by dialectologists.
% NOTE: 2-d stuff is not the primary problem, since we can't compare
% trees to trees anyway. The primary problem is comparing two corpora
% full of differing sentences. A secondary problem arises to make sure
% that the 2-d-extracted features aren't skewed one way or another. I
% guess I need to come up with a general justification for the
% normalizing and smoothing code from Nerbonne & Wiersma
% Additional problems: phonology is 1-dimensional, with one obvious way
% to decompose words into segments and segments into features. Syntax is
% 2-dimensional, so the decomposition must take several more factors
% into account so that the features it produces are
% useful and comparable to each other. And those features are \ldots
% % TODO Henrik Rosenkvist seems to
% % be the main guy interested in syntactic analysis of dialect distance
% Overview : Goal, Variables, Method
% Contribution
% Literature Review
% : (including theoretical background)
% Draw hypotheses from earlier studies
% Method
% :
% Experiment section as 'Corpus' section
% Goal: To extend existing measurement methods. To measure them
% better. To measure them on more complete data.
\section{Previous Work}
\subsection{S\'eguy}
Measurement of linguistic similarity has always been a part of
linguistics. However, until \namecite{seguy73} dubbed a new set of
approaches `dialectometry', these methods lagged behind the rest of
linguistics in formality. S\'eguy's quantitative analysis
of Gascogne French, while not aided by computer, was the predecessor
of more powerful statistical methods that essentially required the use
of computer as well as establishing the field's general dependence on
well-crafted dialect surveys that divide incoming data along
traditional linguistic boundaries: phonology, morphology, syntax, etc.
This makes both collection and analysis easier, although it requires
more work to combine separate analyses to produce a complete picture of dialect
variation.
The project to build the Atlas Linguistique et Ethnographique de la
Gascogne, which S\'eguy directed, collected data in a dialect survey
of Gascogne which asked speakers questions informed by different areas
of linguistics. For example, the pronunciation of `dog' ({\it chien})
was collected to measure phonological variation. It had two common
variants and many other rare ones: [k\~an], [k\~a], as well as [ka],
[ko], [kano], among others. These variants were, for the most part,
% or hat "chapeau": SapEu, kapEt, kapEu (SapE, SapEl, kapEl
known by linguists ahead of time, but their exact geographical
distribution was not.
The atlases, as eventually published, contained not only annotated
maps, but some analyses as well. These analyses were what S\'eguy named
dialectometry. Dialectometry differs from previous attempts to find
dialect boundaries in the way it combines information from the
dialect survey. Previously, dialectologists found isogloss
boundaries for individual items. A dialect boundary was generated when
enough individual isogloss boundaries coincided. However, for any real
corpus, there is so
much individual variation that only major dialect boundaries can
be captured this way.
S\'eguy reversed the process. He first combined survey data to get
a numeric score between each site. Then he posited dialect boundaries
where large distances resulted between sites. The difference is
important, because a single numeric score is easier to
analyze than hundreds of individual boundaries.
Much more subtle dialect boundaries are visible this way; where before
one saw only a jumble of conflicting boundary lines, now one sees
smaller, but consistent, numerical differences separating regions. {Dialectometry
enables classification of gradient dialect boundaries, since now one
can distinguish weak and strong boundaries. Previously, weak
boundaries were too uncertain.}
However, S\'eguy's method of combination is simple both
linguistically and mathematically. When comparing two sites, any
difference in a response is counted as 1. Only identical
responses count as a distance of 0. Words are not analyzed
phonologically, nor are responses weighted by their relative amount
of variation. Finally, only geographically adjacent sites are
compared. This is a reasonable restriction, but later studies were
able to lift it because of the availability of greater computational
power. Work following S\'eguy's improves on both aspects. In
particular, Hans Goebl developed dialectometry models that are
more mathematically sophisticated.
\subsection{Goebl}
Hans Goebl emerged as a leader in the field of dialectometry,
formalizing the aims and methods of dialectometry. His primary
contribution was development of various methods to combine individual
distances into global distances and global distances into global clusters. These
methods were more sophisticated mathematically than previous
dialectometry and operated on any features extracted from the data. His
analyses have used primarily the Atlas Linguistique de Fran\c{c}ais.
\namecite{goebl06} provides a summary of his work. Most relevant for
this paper are the measures Relative Identity Value and Weighted
Identity Value. They are general methods that are the basis for nearly
all subsequent fine-grained dialectometrical analyses. They have three
important properties. First, they are independent of the source
data. They can operate over any linguistic data for which they are
given a feature set, such as the one proposed by \namecite{gersic71} for
phonology. Second, they can compare data even for items that do not
have identical feature sets, such as Ger\v{s}i\'c's $d$,
which cannot compare consonants and vowels. Third, they can compare
data sets that are missing some entries. This improves on S\'eguy's
analysis by providing a principled way to handle missing survey
responses.
Relative Identity Value, when comparing any two items, counts the
number of features which share the same value and then discounts
(lowers) the importance of the result by the number of unshared
features. The result is a single percentage that indicates
relative similarity. Calculating this distance between all pairs
of items in two regions produces a matrix which can be used for
clustering or other purposes. Note that the presentation below splits
Goebl's original equations into more manageable pieces; the high-level
equation for Relative Identity Value is:
\begin{equation}
\frac{\textrm{identical}_{jk}} {\textrm{identical}_{jk} - \textrm{unidentical}_{jk}}
\label{riv}
\end{equation}
For some items being compared $j$ and $k$. In this case
\textit{identical} is
\begin{equation}
\textrm{identical}_{jk} = |f \in \textrm{\~N}_{jk} : f_j = f_k|
\end{equation}
where $\textrm{\~N}_{jk}$ is the set of features shared by $j$ and
$k$ and $f_j$ and $f_k$ are the value of some feature $f$ for $j$ and
$k$ respectively. \textit{unidentical} is defined similarly, except
that it counts all features N, not just the shared features
$\textrm{\~N}_{jk}$.
\begin{equation}
\textrm{unidentical}_{jk} = |f \in \textrm{N} : f_j \neq f_k|
\end{equation}
Weighted Identity Value is a refinement of Relative Identity
Value. This measure defines some differences as more
important than others. In particular, feature values that only occur
in a few items give more information than feature values that appear
in a large number of items. This
idea shows up later in the normalization of syntax distance given by
\namecite{nerbonne06}.
The mathematical reasoning behind this idea is fairly simple. Goebl
is interested in feature values that occur in only a few items. If a
feature has some value that is shared by all of the items, then all
items belong to the same group. This feature value provides {\it no}
useful information for distinguishing the items. The situation
improves if all but one item share the same value for a feature; at
least there are now two groups, although the larger group is still not
very informative. The most information is available if each item
being studied has a different value for a feature; the items fall
trivially into singleton groups, one per item.
Equation \ref{wiv-ident} implements this idea by discounting
the \textit{identical} count from equation \ref{riv} by
the amount of information that feature value conveys. The
amount of information, as discussed above, is based on the number of
items that share a particular value for a feature. If all items share
the same value for some feature, then \textit{identical} will be discounted all the
way to zero--the feature conveys no useful information.
Weighted Identical Value's equation for \textit{identical} is
therefore
\begin{equation}
\textrm{identical} = \sum_f \left\{
\begin{array}{ll}
0 & \textrm{if} f_j \neq f_k \\
1 - \frac{\textrm{agree}f_{j}}{(Ni)w} & \textrm{if} f_j = f_k
\end{array} \right.
\label{wiv-ident}
\end{equation}
\noindent{}The complete definition of Weighted Identity Value is
\begin{equation} \sum_i \frac{\sum_f \left\{
\begin{array}{ll}
0 & \textrm{if} f_j \neq f_k \\
1 - \frac{\textrm{agree}f_j} {(Ni)w} & \textrm{if} f_j = f_k
\end{array} \right.}
{\sum_f \left\{
\begin{array}{ll}
0 & \textrm{if} f_j \neq f_k \\
1 - \frac{\textrm{agree}f_j} {(Ni)w} & \textrm{if} f_j = f_k
\end{array} \right. - |f \in \textrm{N} : f_j \neq f_k|}
\label{wiv-full}
\end{equation}
\noindent{}where $\textrm{agree}f_{j}$ is the number of items that agree
with item $j$ on feature $f$ and $Ni$ is the total number of
items ($w$ is the weight, discussed below). Because of the
piecewise definition of \textit{identical}, this number is always at
least $1$ because $f_k$ agrees already with $f_j$.
This equation takes the count of shared features and weights
them by the size of the sharing group. The features that are shared
with a large number of other items get a larger fraction of the normal
count subtracted.
For example, let $j$ and $k$ be sets of productions for the
underlying English segment /s/. The allophones of /s/ vary mostly on the feature
\textit{voice}. Seeing an unvoiced [s] for /s/ is less ``surprising'' than
seeing a voiced [z], so the discounting process should
reflect this. For example, assume that an English corpus contains 2000
underlying /s/ segments. If 500 of them are realized as [z], the
discounting for \textit{voice} will be as follows:
\begin{equation}
\begin{array}{c}
identical_{/s/\to[z]} = 1 - 500/2000 = 1 - 0.25 = 0.75 \\
identical_{/s/\to[s]} = 1 - 1500/2000 = 1 - 0.75 = 0.25
\end{array}
\label{wiv-voice}
\end{equation}
Each time /s/ surfaces as [s], it only receives 1/4 of a point
toward the agreement score when it matches another [s]. When /s/
surfaces as [z], it receives three times as much for matching
another [z]: 3/4 points towards the agreement score. If the
alternation is even more weighted toward faithfulness, the ratio
changes even more; if /s/ surfaces as [z] only 1/10 of the time,
then [z] receives 9 times more value for matching than [s] does.
The final value, $w$, which is what gives the name ``weighted
identity value'' to this measure, provides a way to control how much
is discounted. A high $w$ will subtract more from uninteresting
groups, so that \textit{voice} might be worth less than
\textit{place} for /t/ because /t/'s allophones vary more over
\textit{place}. In equation \ref{wiv-voice}, $w$ is left at 1 to
facilitate the presentation.
\section{Statistical Methods} % Computational? Mathematical?
It is at this point that the two types of analysis, phonological and
syntactic, diverge. Although Goebl's techniques are general enough to
operate over any set of features that can be extracted, better results
can be obtained by specializing the general measures above to take
advantage of properties of the input. Specifically, the application
of computational linguistics to dialectometry beginning in the 1990s
introduced methods from other fields. These methods, while generally
giving more accurate results quickly, are tied to the type of data on
which they operate.
% NEW
Currently, the dominant phonological distance measure is Levenshtein
distance. This distance is essentially the count of differing
segments, although various refinements have been tried, such as
inclusion of distinctive features or phonetic
correlates. \namecite{heeringa04} gives an excellent analysis of the
applications and variations of Levenshtein distance. While Levenshtein
distance provides much information as a classifier, it is limited
because it must have a word aligned corpus for comparison. A number of
statistical methods have been proposed that remove this requirement
such as \namecite{hinrichs07} and \namecite{sanders09}, but none have
been as successful on existing dialect resources, which are small and
are already word-aligned. New resources are not easy to develop
because the statistical methods still rely on a phonetic transcription
process.
% end NEW
% \begin{enumerate}
% \item I should really check around to see if there is any new work out
% there. Surely there is. Course John is free to do whatever works and
% Wybo may have graduated or something. So there might not be any more
% work on it.
% \item Explain leaf-ancestor paths, trigrams, dependency `paths' (to be
% invented).
% \end{enumerate}
\subsection{Syntactic Distance}
Recently, computational dialectometry has expanded to analysis of
syntax as well. The first work in this area was \quotecite{nerbonne06}
analysis of Finnish L2 learners of English, followed by
\quotecite{sanders07} analysis of British dialect areas. Syntax
distance must be approached quite differently than phonological
distance. Syntactic data is extractable from raw text, so it is much
easier to build a syntactic corpus. But this implies an associated
drop in manual linguistic processing of the data. As a result, the
principal difference between present phonological and syntactic
corpora is that phonology data is word-aligned, while syntax data is
not sentence-aligned. Automatically constructed syntactic corpora
lead naturally to statistical measures over large amounts of data
rather than more sensitive measures that operate on small corpora.
\subsubsection{Nerbonne and Wiersma}
\label{nerbonne06}
Due to the lack of alignment between the
larger corpora available for syntactic analysis, a statistical
comparison of differences is more appropriate than the simple
symbolic approach possible with the word-aligned corpora used in
phonology. This statistical approach means that a syntactic distance
measure will have to use counting as its basis.
\namecite{nerbonne06} was an early method proposed for syntactic
distance. It models syntax by part-of-speech (POS) trigrams and uses
differences between trigram type counts in a permutation test of
significance. This method was extended by \namecite{sanders07}, who
used \quotecite{sampson00} leaf-ancestor paths as an alternate basis
for building the model.
The heart of the measure is simple: the difference in type counts
between the combined types of two corpora. \namecite{kessler01}
originally proposed this measure, the {\sc Recurrence}
metric ($R$):
\begin{equation}
R = \Sigma_i |c_{ai} - c_{bi}|
\label{rmeasure}
\end{equation}
\noindent{}Given two corpora $a$ and $b$, $c_a$ and $c_b$ are the type
counts. $i$ ranges over all types, so $c_{ai}$ and $c_{bi}$ are the
type counts of corpora $a$ and $b$ for type $i$. $R$ is designed to
represent the amount of variation exhibited by the two corpora while
the contribution of individual types remains transparent to aid later
analysis.
To account for differences in corpus size, sampling with replacement is
used. In addition, the samples are normalized to account for
differences in sentence length and complexity. Unfortunately, even normalized, the
measure doesn't indicate whether its results are significant; a
permutation test is needed for that.
% Other ideas include training a
% model on one area and comparing the entropy (compression) of other
% areas. At this point it's unclear whether this would provide a
% comparable measure, however.
\subsubsection{Language models}
\label{syntactic-features}
\namecite{nerbonne06} argue that POS trigrams can accurately represent
at least the important parts of syntax, similar to the way chunk
parsing can capture the most important information about a
sentence. If this is true, POS trigrams are a good starting point for
a language model; they are simple and easy to obtain in a number of
ways. They can either be generated by a tagger as Nerbonne
and Wiersma did, or taken from the leaves of the trees of a
syntactically annotated corpus as \namecite{sanders07} did with the
International Corpus of English.
On the other hand, it might be better to represent the upper structure
of trees, assuming that syntax is in fact a phenomenon that extends beyond the
lexical. \quotecite{sampson00} leaf-ancestor paths provide one way to
do this: for each leaf in the tree, leaf-ancestor paths produce the
path from that leaf back to the root. Generation is simple as long as
every sibling is unique. For example, the parse tree
\[\xymatrix{
&&\textrm{S} \ar@{-}[dl] \ar@{-}[dr] &&\\
&\textrm{NP} \ar@{-}[d] \ar@{-}[dl] &&\textrm{VP} \ar@{-}[d]\\
\textrm{Det} \ar@{-}[d] & \textrm{N} \ar@{-}[d] && \textrm{V} \ar@{-}[d] \\
\textrm{the}& \textrm{dog} && \textrm{barks}\\}
\]
creates the following leaf-ancestor paths:
\begin{itemize}
\item S-NP-Det-The
\item S-NP-N-dog
\item S-VP-V-barks
\end{itemize}
For identical siblings, brackets must be inserted in the path to
disambiguate the first sibling from the second. The process is
described in \namecite{sampson00} or \namecite{sanders07};
in any case identical siblings are somewhat rare.
Sampson originally developed leaf-ancestor paths as an improved
measure of similarity between gold-standard and machine-parsed trees,
to be used in evaluating parsers. The underlying idea of a collection of
features that capture distance between trees transfers quite nicely to
this application. \namecite{sanders07} replaced POS trigrams with
leaf-ancestor paths for the ICE corpus and found improved results on
smaller corpora than Nerbonne and Wiersma had tested. The additional
precision that leaf-ancestor paths provide appears to aid in attaining
significant results.
% Another idea is supertags rather than leaf-ancestor paths. This is
% quite similar but might work better.
For dependency annotations, it is easy to adapt leaf-ancestor paths to
leaf-head paths. Here, each leaf is associated with a leaf-head path,
the path from the leaf to the head of the sentence via the
intermediate heads. For example, the same sentence, ``The dog barks'',
produces the following leaf-head paths.
\begin{itemize}
\item root-V-N-Det-the
\item root-V-N-dog
\item root-V-barks
\end{itemize}
The biggest difference is in the relative length of the paths: long
leaf-ancestor paths indicate deep nesting of structure. Length is a
weaker indicator of deep structure for leaf-head
paths; sometimes a difference in length indicates only a difference in
centrality to the sentence. % or something, this is still kind of
% wrong
\[\xymatrix{
& & root \\
DET \ar@/^/[r] & NP\ar@/^/[r] & V \ar@{.>}[u] \\
The & dog & barks
}
\]
\subsection{Previous Experiments}
\namecite{nerbonne06} were the first to use the syntactic distance
measure described above. They analyzed two corpora, both of Norwegian
L2 speakers of English. The first corpus was gathered from speakers
who learned English after childhood and the second was gathered from
speakers who learned English as children. Nerbonne \& Wiersma found a
significant difference between the two corpora. The trigrams that
contributed most to the difference were those in the older corpus that
are unexpected in English. For example, the trigram COP-ADJ-N/COM is
not common in English because a noun phrase following a copula
typically begins with a determiner. Other trigrams indicate
hypercorrection on the part of the older speakers; they appear in the
younger corpus but not as often. Nerbonne \& Wiersma analyzed this as
interference from Finnish; the younger learners of English learned it
more completely with less interference from Finnish.
Subsequent work by \namecite{sanders07} and \namecite{sanders08b}
expanded on the Norwegian experiment in two ways. First, it introduced
leaf-ancestor paths as an alternative feature type. Second, it tested
the distance method on a larger set of corpora: Government Office
Regions of England, as well as Scotland and Wales, for a total of
11 corpora. Each was smaller than the Norwegian L2 corpora, so the
permutation test parameters had to be adjusted for some feature
combinations.
The distances between regions were clustered using hierarchical
agglomerative clustering, as described in section \ref{cluster-analysis}. The resulting tree showed a North/South
distinction with some unexpected differences from previously
hypothesized dialect boundaries; for example, the
Northwest region clustered with the Southwest region. This contrasted
with the clustered phonological distances also produced in
\namecite{sanders08b}. In that experiment,
there was no significant correlation between the inter-region
phonological distances and syntactic distances.
There are several possible reasons for this lack of correlation. The
two distance measures may find different dialect boundaries based on
differences between syntax and phonology. Dialect boundaries may have
shifted during the 40 years between the collection of the SED and the
collection of the ICE-GB. One or both methods may be measuring the
wrong thing. However, I will not investigate the relation between
phonology and syntax in this dissertation. The focus will remain on results
of computational syntax distance as compared to traditional syntactic
dialectology.
\section{Hypotheses}
% TODO: Rewrite and merge the following question/hypothesis paragraph pairs
% H1 - organization is all wrong still
The state of syntax measures in dialectometry described above leaves
several research questions unresolved. The most important for this
proposal is whether $R$ is a good measure of syntax
distance. Specifically, have the ambiguous results of previous
research been a shortcoming of $R$, differences between phonological
and syntactic corpora, or differences between phonological and
syntactic dialect boundaries?
To investigate this, I propose Hypothesis 1: the features found by
dialectologists will agree with the highly ranked features used by $R$
for classification. I will test Hypothesis 1 by comparing $R$'s
results to the syntactic dialectology literature on Swedish. In
addition, Hypothesis 1B states that the regions of Sweden accepted by
dialectology will be reproduced by $R$. For example, my
previous research on British English reproduced the well-known North
England-South England dialect regions. However, this research will eliminate the
corpus variability in that research \cite{sanders08b} that resulted in
the confounding factors mentioned above, meaning that more precise
results, such as specific identifying features, should be detectable as well.
However, if $R$ is found to be an inadequate measure of syntax distance, this
dissertation will propose and evaluate alternative syntactic distance
measures. Specifically, $R$ is one way to combine features that are
created by decomposing sentences. It treats features as atomic, and
does not manipulate them in syntax-specific ways. As such, $R$ is
not fundamentally different than Goebl's WIV. This may not be a problem if the
decomposition methods used to generate features adequately capture
dialect differences in independent, atomic features. If dialect
differences cannot be captured by independent, atomic features, then a
more syntax-specific method of combining features will be needed
instead. Alternatively, a more complex statistical measure may be
useful, taking the basic idea of $R$ and increasing its
sensitivity. For example, Kullbeck-Leibler divergence, like $R$,
provides a dissimilarity that is intuitively similar to distance.
%H2 - Dad didn't understand that this is other features to be fed into
%R not replacement of R entire.
A secondary question, relevant once a useful syntax distance measure
is established, is what input features cause $R$ to produce the best
results. Previous work has shown that leaf-ancestor paths provide a
small advantage over part-of-speech trigrams, presumably by capturing
syntactic structure higher in the parse tree. Additional possible
feature sets include variations on the previously investigated
trigrams and leaf-ancestor paths, along with various kinds of backoff,
for example, to bigrams or coarser node tags. Features from dependency
parses may be useful, too, in capturing non-local dependencies that
can be captured neither by trigrams nor leaf-ancestor paths.
Therefore, I propose Hypothesis 2: better input features
for $R$ will produce more accurate syntax
distances. These features can be discovered by comparing performance
of a number of different feature sets on a fixed corpus. In addition,
combinations of successful features will produce even better
performance.
% This sentence is either redundant or should appear earlier.
The quality of a set of features can be
measured by its sensitivity---the number of significant distances it
finds---and the similarity of the highly ranked features $R$ produces
to those found by dialectologists.
% H3
A third question is whether $R$ agrees with phonological distance
measures like Levenshtein distance. Unlike agreement with traditional
dialectology, there is no {\it a priori} reason to expect agreement
between phonology and syntax in delineating dialect
boundaries. However, agreement with phonological distance would be
further evidence for the suitability of $R$ as a syntactic distance
measure.
Therefore, I propose Hypothesis 3: a phonology corpus and syntax corpus
constructed from the same data will provide better correlation between
phonology and syntax distance measures than a phonology corpus and
syntax corpus drawn from different data sources. I will test this
hypothesis by comparing results to my previous work on British English
phonological and syntactic corpora; there, no significant correlation
was found between the regions extracted from the two corpora drawn
from different data. If significant correlation is found by using the
same set of data for both corpora, it indicates that phonology and
syntax boundaries do coincide but that the agreement is weak enough to
be lost when using corpora collected from different populations.
\section{Methods}
To investigate the first hypothesis, I need a dialect corpus that can
be syntactically annotated (\ref{syntactically-annotated-corpus}); if
it is not already annotated, it must be possible to annotate it
automatically so I can avoid time-consuming manual annotation.
Automatic annotation will require a syntactically annotated
training corpus (\ref{syntactically-annotated-training}) and a parser
(\ref{parsers-proposal}). A distance measure must be defined for the regions
within the dialect corpus (\ref{nerbonne06}), syntactic features must
be extracted for the distance measures (\ref{syntactic-features}), and
the results tested for significance (\ref{permutationtest}) and
clustered (\ref{cluster-analysis}) to determine which dialect regions
are found by the corpus. Finally, the most highly ranked features used
to produce the dialect distances must be enumerated
(\ref{feature-ranking-proposal}).
To investigate the second hypothesis, I need a method to combine
different types of features (\ref{combine-feature-sets}) and back off
sparse features (\ref{feature-backoff}). I also need a way to generate
new features that include more information about context
(\ref{alternate-feature-sets}).
If the distance measure $R$ doesn't provide any significant distances
with any combination of features, I will experiment with different distance
measures. For this, there are quite a few possibilities;
Kullbeck-Leibler divergence is one example (\ref{kl-divergence}).
To investigate the third hypothesis, I need a phonological corpus and a method
for calculating phonological dialect distance, then a method to compare
phonological clusters with syntactic clusters. See my qualifying paper
\cite{sanders08b} for details.
\subsection{SweDiaSyn} % This is not a good subsection for the new organization
\label{syntactically-annotated-corpus}
The first hypothesis requires a dialect corpus that can
be syntactically annotated.
The dialect corpus used in this dissertation will be SweDiaSyn, the
Swedish part of the ScanDiaSyn.
% (CITE SweDiaSyn and ScanDiaSyn,
% except that they don't seem to have any references)
% Here is a citation for ScanDiaSyn if I could track it down and
% translate it
% Vangsnes, Øystein A. 2007. ScanDiaSyn: Prosjektparaplyen Nordisk dialektsyntaks. In T. Arboe (ed.), Nordisk dialektologi og sociolingvistik, Peter Skautrup Centeret for Jysk Dialektforskning, Århus Universitet. 54-72.
SweDiaSyn is a transcription of SweDia 2000 \cite{bruce99} collected
between 1998 and 2000 from 97 locations in Sweden and 10 in
Finland. Each location has 12 interviewees: three 30-minute interviews
for each of older male, older female, younger male and younger female.
However, the SweDiaSyn transcriptions do not yet include all of SweDia
2000; the completed transcriptions currently focus on older
speakers.
Currently there are 36,713 sentences of transcribed speech
from 49 sites, an average of 749 sentences per site.
However, the sites range from 110 to 1780 sentences because some sites
have fewer complete transcriptions than others. In order to detect
significant differences, the sites may need to be grouped by county,
traditional province or EU region; previous work on British English
used EU Government Office Regions with at least 850 sentences per
region. For example, grouping the Swedish corpora into the 25 provinces
boosts the average sentences per province to 1254, excluding provinces
with no transcriptions.
% TODO: Probably switch the second sentence to be first? It's the more
% important but might completely depend on details in the first.
In the SweDiaSyn, there are two types of transcription:
standard Swedish orthography, with glosses for words
not in standard Swedish, and a phonetic transcription for dialects
that differ greatly from standard Swedish. For this dissertation,
the orthographic/gloss transcription will be used so that lexical
items will be comparable across dialects.
\subsection{Talbanken}
\label{syntactically-annotated-training}
Because the first hypothesis requires a syntactically annotated
corpus, and because SweDiaSyn consists of untagged lexical items,
Talbanken05, a syntactically-annotated corpus, will be used to train a
POS tagger and parsers to be used to annotate SweDiaSyn. Talbanken05
is a treebank of written and transcribed spoken Swedish, roughly
300,000 words in size. It is an updated version of Talbanken76
\cite{nivre06}; Talbanken76's trees are annotated following a custom
scheme called MAMBA; Talbanken05 adds phrase structure annotation and
dependency annotation using the standard annotation formats TIGER-XML
and Malt-XML. In addition to syntactic annotation, Talbanken is
lexically annotated for morphology and part-of-speech.
% TODO: Should I keep this? It depends on how much detail I want.
% Talbanken's sources are X and Y and Z. It attempts to provide a
% valid sample of the Swedish language, both spoken and written. The
% spoken section is transcribed from conversation, interviews and
% debates, and the written section is taken from high school essays and
% professional prose (TODO:I could probably cite
% Jan Einarsson. 1976. Talbankens skriftspraakskonkordans. Lund
% University: Department of Scandinavian Languages (and
% talspraakskonkordans) IF I could legitimately claim that I got the
% information from there\ldots{} but of course I got it from
% spraakbanken.gu.se/om/eng/index.html actually.
\subsection{Parsing}
\label{parsers-proposal}
%% Um, this seems useful, but I'm not sure why I put it here...
%% TODO: Figure out what this is for and use it somewhere?
% In order to investigate hypothesis 1, I will need to produce features
% to give to the classifier. These features should reflect the syntax of
% the speech of the interviewees. Following Nerbonne and Wiersma 2006, I
% will start with parts of speech, then add the leaf-ancestor paths that
% I tried on the ICE-GB, and finally add dependency-ancestor paths that
% are new. Probably one sentence more each on tagging, dependency and
% constituency parsing.
% (NOTE: Insert paragraphs on tagging and dependency and constituency
% parsing before Talbanken discussion)
% :
In order to extract the features used to build the language models
described in the previous methods, SweDiaSyn will need to be POS
tagged and parsed. For this dissertation, both constituency
and dependency features will be provided to the classifier.
The Tags 'n' Trigrams (T'n'T) tagger \cite{brants00} will be used for tagging, with
the POS annotations from Talbanken05 used as training.
After POS tagging, the Talbanken sentences will be cleaned in order to
be usable for training the parsers.
Cleaning Talbanken's constituency annotations consists of removing
discontinuities of various types, especially disfluencies and
restarts, which may be reparable by a simple top-level strategy. If
more complicated uncrossing is needed, a strategy similar to the split
constituents proposed by \namecite{boyd07} may be needed.
For constituency parsing, the Berkeley parser \cite{petrov08} will be
trained on standard Swedish, again from Talbanken05. The Berkeley
parser has shown good performance on languages other than English,
which is not common for constituency parsers.
% TODO: CITE The paper that shows this. Also EXPLAIN it.
For dependency parsing, MaltParser will be used with the existing
Swedish model trained on Talbanken05 by Hall, Nilsson and
Nivre. MaltParser is an inductive dependency parser that uses a
machine learning algorithm to guide the parser at choice points
\cite{nivre06b}. Dependency parsing will proceed similarly to
constituency parsing; the dependency structures of Talbanken05 will be
cleaned and normalized, then used to train a parser.
% TODO: Find out how much crossing occurs in Swedish corpora, and how
% much of it is from interruptions and self-corrections.
\subsection{Permutation test}
\label{permutationtest}
The first hypothesis requires that the distances produced by a
distance measure be checked for significance; it is possible that
there may not be enough data for two regions to adequately distinguish
them from each other. A permutation test detects whether two corpora are
significantly different on the basis of the $R$ measure
described in section \ref{nerbonne06}. The test first calculates $R$
between samples of the two corpora. Then the corpora are mixed
together and $R$ is calculated between two samples drawn from the
mixed corpus. If the two corpora are different, $R$ should be larger
between the samples of the original corpora than $R$ from the mixed
corpus: any real differences will be randomly redistributed by the
mixing process, lowering the mixed $R$. Repeating this comparison
enough times will show if the difference is significant. Twenty times
is the minimum needed to detect significance for $p < 0.05$
significance; however, in the experiments, I will repeat the test 100
times, enough to detect significance for $p < 0.01$.
To see how this works, for example, assume that $R$ detects real
differences between the two British regions London
and Scotland such that $R(\textrm{London},\textrm{Scotland}) =
100$. The permutation test then mixes London and Scotland to
create LSMixed and splits it into two pieces. Since the real
differences are now mixed between the two shuffled corpora, we
would expect $R(\textrm{LSMixed}_1, \textrm{LSMixed}_2) < 100$.
This should be true at least 95\% of the time for the distance $100$
to be significant.
%% I don't think normalization is important enough to mention if I
%% have to add all the sections from the H2/H3.
% \subsection{Normalization}
% Afterward, the distance must be normalized to account for two things:
% the length of sentences in the corpus and the amount of variety in the
% corpus. If sentence length differs too much between corpora, there
% will be consistently lower token counts in one corpus, which would
% cause a spuriously large $R$. In addition, if one corpus has less
% variety than the other, it will have inflated type counts, because
% more tokens will be allocated to fewer types. To avoid
% this, all tokens are scaled by the average number of types per token
% across both corpora: $2n/N$ where $n$ is the type count and $N$ is
% the token count. The factor $2$ is necessary because the scaling
% occurs based on the token counts of the two corpora combined.
% this next subsection might need to be changed or deleted
\subsection{Cluster Analysis and Correlation}
\label{cluster-analysis}
The first hypothesis requires a clustering method to allow
inter-region distances to be compared more easily. The dendrogram that
binary hierarchical clustering produces allows easy visual comparison
of the most similar regions.
Correlation is also useful to find out how similar the two method's
predictions are. Because of the connected nature of the inter-region
distances, Mantel's test is necessary to ensure that the correlation
is significant. Mantel's test is a permutation test, much like the
permutation test described for $R$. One distance result set is
permuted repeatedly and at each step correlated with the other
set. The original correlation is significant if the permuted
correlation is lower than the original correlation more than 95\% of
the time.
\subsection{Feature Ranking}
\label{feature-ranking-proposal}
% TODO: THIS is hypothesis 1B and I left it out!
Feature ranking is needed for the first hypothesis so that the results
of $R$ can be compared qualitatively to the Swedish dialectology
literature; $R$'s most important features should be similar to those
discussed most by dialectologists when comparing regions. Feature
ranking for $R$ is quite simple for one-to-one region comparisons;
each feature's normalized weight is equal to its importance in
determining the distance between the two regions. The most important
features between two sets of regions can be obtained by averaging the
importance of each feature between all (first-set, second-set) region
pairs. This more
% (There is a nice equation lurking in here
% somewhere that I may want to avoid nonetheless.)
complicated technique is needed to relate the results from
the computational distance measures with the features that
dialectologists discuss relative to areas of Sweden larger than
individual provinces or counties.
%Note: All this is speculative. I have no code for this and I'm pretty
%sure the all-pairs average solution is not quite right
\subsection{Combining Feature Sets}
\label{combine-feature-sets}
% maybe use 'kind of feature' instead of 'type of feature'. it's fuzzier
In order to investigate hypothesis 2, I will need a method for
combining features of different types. Here, the obvious approach
of combining feature types linearly should suffice. For example, within a single
type of feature, such as POS tag or leaf-ancestor path,
there is already redundant information about lexical items and tree
structure, so combining the two does not mean that additional
redundancy needs to be taken into account.
% TODO: Last sentence still sucks
% TODO: Notes from last presentation:
% TnT has a 'mark unknown words' option.
% Pass lexical items to Berkeley parser (or make sure both training and test are t
% he same)
% check whether Berkeley parser has POS-tagging option
\subsection{Feature Backoff}
\label{feature-backoff}
% If I can't find a certain frequency of trigrams then backoff to
% bigram
% Thorsten Brants - deleted interpolation
% Martin Volk inter-type backoff (if X type information isn't
% available, then use Y type instead) (2000, 2001 or 2002)
% Scaling Up ...; Exploiting the WWW ...; Combining Unsupervised ...;
In order to investigate hypothesis 2, I will need a framework for
backing off sparse features. For backoff within a single type of
feature, I will use deleted interpolation \cite{jurafskymartin}. Training for the trigram, bigram and unigram counts will
come from the Talbanken.
For backoff between types of features, I will use ranked combinations