Skip to content

Commit accacc4

Browse files
committed
* hash_lemmas added to provide a lookup of Mechura's lemmatization list.
1 parent a9a235b commit accacc4

File tree

8 files changed

+336
-83
lines changed

8 files changed

+336
-83
lines changed

DESCRIPTION

+1-1
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ Maintainer: Tyler Rinker <[email protected]>
77
Description: A collection of lexical hash tables, dictionaries, and
88
word lists.
99
Depends: R (>= 3.2.2)
10-
Date: 2017-01-28
10+
Date: 2017-02-12
1111
License: MIT + file LICENSE
1212
LazyData: TRUE
1313
Roxygen: list(wrap = FALSE)

NEWS

+1
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ NEW FEATURES
2424

2525
* `hash_grady_pos` added to provide a lookup of Grady's parts of speech for words.
2626

27+
* `hash_lemmas` added to provide a lookup of Mechura's lemmatization list.
2728

2829
lexicon 0.1.0
2930
----------------------------------------------------------------

NEWS.md

+1
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ lexicon 0.1.1 -
2424

2525
* `hash_grady_pos` added to provide a lookup of Grady's parts of speech for words.
2626

27+
* `hash_lemmas` added to provide a lookup of Mechura's lemmatization list.
2728

2829
lexicon 0.1.0
2930
----------------------------------------------------------------

R/hash_lemmas.R

+20
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
#' Lemmatization List
2+
#'
3+
#' A dataset based on M\u{e}chura's (2016) English lemmatization list. This
4+
#' data set can be useful for join style lemma replacement of inflected token
5+
#' forms to their root lemmas. While this is not a true morphalogical analysis
6+
#' this style of lemma replacement is fast and typically still robust.
7+
#'
8+
#' @details
9+
#' \itemize{
10+
#' \item token. An inflected token with affixes
11+
#' \item lemma. A base form
12+
#' }
13+
#'
14+
#' @docType data
15+
#' @keywords datasets
16+
#' @name hash_lemmas
17+
#' @usage data(hash_lemmas)
18+
#' @format A data frame with 41,533 rows and 2 variables
19+
#' @references M\u{e}chura, M. B. (2016). \emph{Lemmatization list: English (en)} [Data file]. Retrieved from \url{http://www.lexiconista.com}
20+
NULL

README.md

+86-82
Original file line numberDiff line numberDiff line change
@@ -31,34 +31,34 @@ word lists. The data prefixes help to categorize the data types:
3131
<table>
3232
<thead>
3333
<tr class="header">
34-
<th align="left">Prefix</th>
35-
<th align="left">Meaning</th>
34+
<th>Prefix</th>
35+
<th>Meaning</th>
3636
</tr>
3737
</thead>
3838
<tbody>
3939
<tr class="odd">
40-
<td align="left"><code>key_</code></td>
41-
<td align="left">A <code>data.frame</code> with a lookup and return value</td>
40+
<td><code>key_</code></td>
41+
<td>A <code>data.frame</code> with a lookup and return value</td>
4242
</tr>
4343
<tr class="even">
44-
<td align="left"><code>hash_</code></td>
45-
<td align="left">A keyed <code>data.table</code> hash table</td>
44+
<td><code>hash_</code></td>
45+
<td>A keyed <code>data.table</code> hash table</td>
4646
</tr>
4747
<tr class="odd">
48-
<td align="left"><code>freq_</code></td>
49-
<td align="left">A <code>data.table</code> of terms with frequencies</td>
48+
<td><code>freq_</code></td>
49+
<td>A <code>data.table</code> of terms with frequencies</td>
5050
</tr>
5151
<tr class="even">
52-
<td align="left"><code>pos_</code></td>
53-
<td align="left">A part of speech <code>vector</code></td>
52+
<td><code>pos_</code></td>
53+
<td>A part of speech <code>vector</code></td>
5454
</tr>
5555
<tr class="odd">
56-
<td align="left"><code>pos_df_</code></td>
57-
<td align="left">A part of speech <code>data.frame</code></td>
56+
<td><code>pos_df_</code></td>
57+
<td>A part of speech <code>data.frame</code></td>
5858
</tr>
5959
<tr class="even">
60-
<td align="left"><code>sw_</code></td>
61-
<td align="left">A stopword <code>vector</code></td>
60+
<td><code>sw_</code></td>
61+
<td>A stopword <code>vector</code></td>
6262
</tr>
6363
</tbody>
6464
</table>
@@ -73,142 +73,146 @@ Data
7373
</colgroup>
7474
<thead>
7575
<tr class="header">
76-
<th align="left">Data</th>
77-
<th align="left">Description</th>
76+
<th>Data</th>
77+
<th>Description</th>
7878
</tr>
7979
</thead>
8080
<tbody>
8181
<tr class="odd">
82-
<td align="left"><p><code>common_names</code></p></td>
83-
<td align="left"><p>First Names (U.S.)</p></td>
82+
<td><p><code>common_names</code></p></td>
83+
<td><p>First Names (U.S.)</p></td>
8484
</tr>
8585
<tr class="even">
86-
<td align="left"><p><code>discourse_markers_alemany</code></p></td>
87-
<td align="left"><p>Alemany's Discourse Markers</p></td>
86+
<td><p><code>discourse_markers_alemany</code></p></td>
87+
<td><p>Alemany's Discourse Markers</p></td>
8888
</tr>
8989
<tr class="odd">
90-
<td align="left"><p><code>dodds_sentiment</code></p></td>
91-
<td align="left"><p>Language Assessment by Mechanical Turk Sentiment Words</p></td>
90+
<td><p><code>dodds_sentiment</code></p></td>
91+
<td><p>Language Assessment by Mechanical Turk Sentiment Words</p></td>
9292
</tr>
9393
<tr class="even">
94-
<td align="left"><p><code>freq_first_names</code></p></td>
95-
<td align="left"><p>Frequent U.S. First Names</p></td>
94+
<td><p><code>freq_first_names</code></p></td>
95+
<td><p>Frequent U.S. First Names</p></td>
9696
</tr>
9797
<tr class="odd">
98-
<td align="left"><p><code>freq_last_names</code></p></td>
99-
<td align="left"><p>Frequent U.S. Last Names</p></td>
98+
<td><p><code>freq_last_names</code></p></td>
99+
<td><p>Frequent U.S. Last Names</p></td>
100100
</tr>
101101
<tr class="even">
102-
<td align="left"><p><code>function_words</code></p></td>
103-
<td align="left"><p>Function Words</p></td>
102+
<td><p><code>function_words</code></p></td>
103+
<td><p>Function Words</p></td>
104104
</tr>
105105
<tr class="odd">
106-
<td align="left"><p><code>grady_augmented</code></p></td>
107-
<td align="left"><p>Augmented List of Grady Ward's English Words and Mark Kantrowitz's Names List</p></td>
106+
<td><p><code>grady_augmented</code></p></td>
107+
<td><p>Augmented List of Grady Ward's English Words and Mark Kantrowitz's Names List</p></td>
108108
</tr>
109109
<tr class="even">
110-
<td align="left"><p><code>hash_emoticons</code></p></td>
111-
<td align="left"><p>Emoticons</p></td>
110+
<td><p><code>hash_emoticons</code></p></td>
111+
<td><p>Emoticons</p></td>
112112
</tr>
113113
<tr class="odd">
114-
<td align="left"><p><code>hash_grady_pos</code></p></td>
115-
<td align="left"><p>Grady Ward's Moby Parts of Speech</p></td>
114+
<td><p><code>hash_grady_pos</code></p></td>
115+
<td><p>Grady Ward's Moby Parts of Speech</p></td>
116116
</tr>
117117
<tr class="even">
118-
<td align="left"><p><code>hash_power</code></p></td>
119-
<td align="left"><p>Power Lookup Key</p></td>
118+
<td><p><code>hash_lemmas</code></p></td>
119+
<td><p>Lemmatization List</p></td>
120120
</tr>
121121
<tr class="odd">
122-
<td align="left"><p><code>hash_sentiment</code></p></td>
123-
<td align="left"><p>Polarity Lookup Key</p></td>
122+
<td><p><code>hash_power</code></p></td>
123+
<td><p>Power Lookup Key</p></td>
124124
</tr>
125125
<tr class="even">
126-
<td align="left"><p><code>hash_sentiment_nrc</code></p></td>
127-
<td align="left"><p>NRC Sentiment Lookup Key</p></td>
126+
<td><p><code>hash_sentiment</code></p></td>
127+
<td><p>Polarity Lookup Key</p></td>
128128
</tr>
129129
<tr class="odd">
130-
<td align="left"><p><code>hash_sentiword</code></p></td>
131-
<td align="left"><p>Augmented Sentiword</p></td>
130+
<td><p><code>hash_sentiment_nrc</code></p></td>
131+
<td><p>NRC Sentiment Lookup Key</p></td>
132132
</tr>
133133
<tr class="even">
134-
<td align="left"><p><code>hash_strength</code></p></td>
135-
<td align="left"><p>Strength Lookup Key</p></td>
134+
<td><p><code>hash_sentiword</code></p></td>
135+
<td><p>Augmented Sentiword</p></td>
136136
</tr>
137137
<tr class="odd">
138-
<td align="left"><p><code>hash_syllable</code></p></td>
139-
<td align="left"><p>Syllable Counts</p></td>
138+
<td><p><code>hash_strength</code></p></td>
139+
<td><p>Strength Lookup Key</p></td>
140140
</tr>
141141
<tr class="even">
142-
<td align="left"><p><code>hash_valence_shifters</code></p></td>
143-
<td align="left"><p>Valence Shifters</p></td>
142+
<td><p><code>hash_syllable</code></p></td>
143+
<td><p>Syllable Counts</p></td>
144144
</tr>
145145
<tr class="odd">
146-
<td align="left"><p><code>key_abbreviation</code></p></td>
147-
<td align="left"><p>Common Abbreviations</p></td>
146+
<td><p><code>hash_valence_shifters</code></p></td>
147+
<td><p>Valence Shifters</p></td>
148148
</tr>
149149
<tr class="even">
150-
<td align="left"><p><code>key_contractions</code></p></td>
151-
<td align="left"><p>Contraction Conversions</p></td>
150+
<td><p><code>key_abbreviation</code></p></td>
151+
<td><p>Common Abbreviations</p></td>
152152
</tr>
153153
<tr class="odd">
154-
<td align="left"><p><code>key_grade</code></p></td>
155-
<td align="left"><p>Grades Hash</p></td>
154+
<td><p><code>key_contractions</code></p></td>
155+
<td><p>Contraction Conversions</p></td>
156156
</tr>
157157
<tr class="even">
158-
<td align="left"><p><code>key_rating</code></p></td>
159-
<td align="left"><p>Ratings Data Set</p></td>
158+
<td><p><code>key_grade</code></p></td>
159+
<td><p>Grades Hash</p></td>
160160
</tr>
161161
<tr class="odd">
162-
<td align="left"><p><code>nrc_emotions</code></p></td>
163-
<td align="left"><p>NRC Emotions</p></td>
162+
<td><p><code>key_rating</code></p></td>
163+
<td><p>Ratings Data Set</p></td>
164164
</tr>
165165
<tr class="even">
166-
<td align="left"><p><code>pos_action_verb</code></p></td>
167-
<td align="left"><p>Action Word List</p></td>
166+
<td><p><code>nrc_emotions</code></p></td>
167+
<td><p>NRC Emotions</p></td>
168168
</tr>
169169
<tr class="odd">
170-
<td align="left"><p><code>pos_adverb</code></p></td>
171-
<td align="left"><p>Adverb Word List</p></td>
170+
<td><p><code>pos_action_verb</code></p></td>
171+
<td><p>Action Word List</p></td>
172172
</tr>
173173
<tr class="even">
174-
<td align="left"><p><code>pos_df_pronouns</code></p></td>
175-
<td align="left"><p>Pronouns</p></td>
174+
<td><p><code>pos_adverb</code></p></td>
175+
<td><p>Adverb Word List</p></td>
176176
</tr>
177177
<tr class="odd">
178-
<td align="left"><p><code>pos_interjections</code></p></td>
179-
<td align="left"><p>Interjections</p></td>
178+
<td><p><code>pos_df_pronouns</code></p></td>
179+
<td><p>Pronouns</p></td>
180180
</tr>
181181
<tr class="even">
182-
<td align="left"><p><code>pos_preposition</code></p></td>
183-
<td align="left"><p>Preposition Words</p></td>
182+
<td><p><code>pos_interjections</code></p></td>
183+
<td><p>Interjections</p></td>
184184
</tr>
185185
<tr class="odd">
186-
<td align="left"><p><code>sw_buckley_salton</code></p></td>
187-
<td align="left"><p>Buckley &amp; Salton Stopword List</p></td>
186+
<td><p><code>pos_preposition</code></p></td>
187+
<td><p>Preposition Words</p></td>
188188
</tr>
189189
<tr class="even">
190-
<td align="left"><p><code>sw_dolch</code></p></td>
191-
<td align="left"><p>Leveled Dolch List of 220 Common Words</p></td>
190+
<td><p><code>sw_buckley_salton</code></p></td>
191+
<td><p>Buckley &amp; Salton Stopword List</p></td>
192192
</tr>
193193
<tr class="odd">
194-
<td align="left"><p><code>sw_fry_100</code></p></td>
195-
<td align="left"><p>Fry's 100 Most Commonly Used English Words</p></td>
194+
<td><p><code>sw_dolch</code></p></td>
195+
<td><p>Leveled Dolch List of 220 Common Words</p></td>
196196
</tr>
197197
<tr class="even">
198-
<td align="left"><p><code>sw_fry_1000</code></p></td>
199-
<td align="left"><p>Fry's 1000 Most Commonly Used English Words</p></td>
198+
<td><p><code>sw_fry_100</code></p></td>
199+
<td><p>Fry's 100 Most Commonly Used English Words</p></td>
200200
</tr>
201201
<tr class="odd">
202-
<td align="left"><p><code>sw_fry_200</code></p></td>
203-
<td align="left"><p>Fry's 200 Most Commonly Used English Words</p></td>
202+
<td><p><code>sw_fry_1000</code></p></td>
203+
<td><p>Fry's 1000 Most Commonly Used English Words</p></td>
204204
</tr>
205205
<tr class="even">
206-
<td align="left"><p><code>sw_fry_25</code></p></td>
207-
<td align="left"><p>Fry's 25 Most Commonly Used English Words</p></td>
206+
<td><p><code>sw_fry_200</code></p></td>
207+
<td><p>Fry's 200 Most Commonly Used English Words</p></td>
208208
</tr>
209209
<tr class="odd">
210-
<td align="left"><p><code>sw_onix</code></p></td>
211-
<td align="left"><p>Onix Text Retrieval Toolkit Stopword List 1</p></td>
210+
<td><p><code>sw_fry_25</code></p></td>
211+
<td><p>Fry's 25 Most Commonly Used English Words</p></td>
212+
</tr>
213+
<tr class="even">
214+
<td><p><code>sw_onix</code></p></td>
215+
<td><p>Onix Text Retrieval Toolkit Stopword List 1</p></td>
212216
</tr>
213217
</tbody>
214218
</table>

data/hash_lemmas.rda

288 KB
Binary file not shown.

0 commit comments

Comments
 (0)