Skip to content

Commit a9a235b

Browse files
committed
hash_grady_pos added to provide a lookup of Grady's parts of speech for words.
1 parent e476a2b commit a9a235b

File tree

11 files changed

+216
-49
lines changed

11 files changed

+216
-49
lines changed

DESCRIPTION

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
Package: lexicon
2-
Title: Lexicons
3-
Version: 0.1.0
2+
Title: Lexicons for Text Analysis
3+
Version: 0.1.1
44
Authors@R: c(person("Tyler", "Rinker", email =
55
"[email protected]", role = c("aut", "cre")))
66
Maintainer: Tyler Rinker <[email protected]>
77
Description: A collection of lexical hash tables, dictionaries, and
88
word lists.
99
Depends: R (>= 3.2.2)
10-
Date: 2017-01-15
10+
Date: 2017-01-28
1111
License: MIT + file LICENSE
1212
LazyData: TRUE
1313
Roxygen: list(wrap = FALSE)

NEWS

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -17,24 +17,26 @@ And constructed with the following guidelines:
1717
* Bug fixes and misc changes bumps the patch
1818

1919

20-
lexicon 0.1.0 -
20+
lexicon 0.1.1 -
2121
----------------------------------------------------------------
2222

23-
BUG FIXES
24-
2523
NEW FEATURES
2624

27-
* The `ratings` and `grades` keys from *sentimentr* have been moved to the
28-
*lexicon* package and renamed to `key_rating` and `key_grade`.
25+
* `hash_grady_pos` added to provide a lookup of Grady's parts of speech for words.
26+
2927

30-
MINOR FEATURES
28+
lexicon 0.1.0
29+
----------------------------------------------------------------
30+
31+
NEW FEATURES
32+
33+
* The `ratings` and `grades` keys from **sentimentr** have been moved to the
34+
**lexicon** package and renamed to `key_rating` and `key_grade`.
3135

3236
IMPROVEMENTS
3337

3438
* Added the positve terms 'spot on', 'on time', & 'on point' to `hash_sentiment`.
3539

36-
CHANGES
37-
3840

3941
lexicon 0.0.1
4042
----------------------------------------------------------------

NEWS.md

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -17,24 +17,26 @@ And constructed with the following guidelines:
1717
* Bug fixes and misc changes bumps the patch
1818

1919

20-
lexicon 0.1.0 -
20+
lexicon 0.1.1 -
2121
----------------------------------------------------------------
2222

23-
**BUG FIXES**
24-
2523
**NEW FEATURES**
2624

27-
* The `ratings` and `grades` keys from *sentimentr* have been moved to the
28-
*lexicon* package and renamed to `key_rating` and `key_grade`.
25+
* `hash_grady_pos` added to provide a lookup of Grady's parts of speech for words.
26+
2927

30-
**MINOR FEATURES**
28+
lexicon 0.1.0
29+
----------------------------------------------------------------
30+
31+
**NEW FEATURES**
32+
33+
* The `ratings` and `grades` keys from **sentimentr** have been moved to the
34+
**lexicon** package and renamed to `key_rating` and `key_grade`.
3135

3236
**IMPROVEMENTS**
3337

3438
* Added the positve terms 'spot on', 'on time', & 'on point' to `hash_sentiment`.
3539

36-
**CHANGES**
37-
3840

3941
lexicon 0.0.1
4042
----------------------------------------------------------------

R/hash_grady_pos.R

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
#' Grady Ward's Moby Parts of Speech
2+
#'
3+
#' A dataset containing a hash lookup of Grady Ward's parts of speech from the
4+
#' Moby project. The words with non-ASCII characters removed.
5+
#'
6+
#' @details
7+
#' \itemize{
8+
#' \item word. The word.
9+
#' \item pos. The part of speech; one of :\code{Adjective}, \code{Adverb}, \code{Conjunction}, \code{Definite Article}, \code{Interjection}, \code{Noun}, \code{Noun Phrase}, \code{Plural}, \code{Preposition}, \code{Pronoun}, \code{Verb (intransitive)}, \code{Verb (transitive)}, or \code{Verb (usu participle)}. Note that the first part of speech for a word is its primary use; all other uses are seondary.
10+
#' \item n_pos. The number of parts of speech associated with a word. Useful for filtering.
11+
#' \item space. logical. If \code{TRUE} the word contains a space. Useful for filtering.
12+
#' \item primary. logical. If \code{TRUE} the word is the primary part of speech used.
13+
#' }
14+
#'
15+
#' @docType data
16+
#' @keywords datasets
17+
#' @name hash_grady_pos
18+
#' @usage data(hash_grady_pos)
19+
#' @format A data frame with 250,892 rows and 5 variables
20+
#' @source \url{http://icon.shef.ac.uk/Moby/mpos.html}
21+
#' @references Moby Thesaurus List by Grady Ward: \url{http://icon.shef.ac.uk/Moby/mpos.html}
22+
#' @examples
23+
#' \dontrun{
24+
#' library(data.table)
25+
#'
26+
#' hash_grady_pos['dog']
27+
#' hash_grady_pos[primary == TRUE, ]
28+
#' hash_grady_pos[primary == TRUE & space == FALSE, ]
29+
#' }
30+
NULL

R/lexicon-package.R

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
#' Lexicons
1+
#' Lexicons for Text Analysis
22
#'
33
#' A collection of lexical hash tables, dictionaries, and word lists.
44
#' @docType package

README.md

Lines changed: 28 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ developed.](http://www.repostatus.org/badges/0.1.0/active.svg)](http://www.repos
88
[![Build
99
Status](https://travis-ci.org/trinker/lexicon.svg?branch=master)](https://travis-ci.org/trinker/lexicon)
1010
[![](http://cranlogs.r-pkg.org/badges/lexicon)](https://cran.r-project.org/package=lexicon)
11-
<a href="https://img.shields.io/badge/Version-0.1.0-orange.svg"><img src="https://img.shields.io/badge/Version-0.1.0-orange.svg" alt="Version"/></a>
11+
<a href="https://img.shields.io/badge/Version-0.1.1-orange.svg"><img src="https://img.shields.io/badge/Version-0.1.1-orange.svg" alt="Version"/></a>
1212
</p>
1313
<img src="inst/lexicon_logo/r_lexicon.png" width="135" alt="lexicon Logo">
1414

@@ -111,98 +111,102 @@ Data
111111
<td align="left"><p>Emoticons</p></td>
112112
</tr>
113113
<tr class="odd">
114+
<td align="left"><p><code>hash_grady_pos</code></p></td>
115+
<td align="left"><p>Grady Ward's Moby Parts of Speech</p></td>
116+
</tr>
117+
<tr class="even">
114118
<td align="left"><p><code>hash_power</code></p></td>
115119
<td align="left"><p>Power Lookup Key</p></td>
116120
</tr>
117-
<tr class="even">
121+
<tr class="odd">
118122
<td align="left"><p><code>hash_sentiment</code></p></td>
119123
<td align="left"><p>Polarity Lookup Key</p></td>
120124
</tr>
121-
<tr class="odd">
125+
<tr class="even">
122126
<td align="left"><p><code>hash_sentiment_nrc</code></p></td>
123127
<td align="left"><p>NRC Sentiment Lookup Key</p></td>
124128
</tr>
125-
<tr class="even">
129+
<tr class="odd">
126130
<td align="left"><p><code>hash_sentiword</code></p></td>
127131
<td align="left"><p>Augmented Sentiword</p></td>
128132
</tr>
129-
<tr class="odd">
133+
<tr class="even">
130134
<td align="left"><p><code>hash_strength</code></p></td>
131135
<td align="left"><p>Strength Lookup Key</p></td>
132136
</tr>
133-
<tr class="even">
137+
<tr class="odd">
134138
<td align="left"><p><code>hash_syllable</code></p></td>
135139
<td align="left"><p>Syllable Counts</p></td>
136140
</tr>
137-
<tr class="odd">
141+
<tr class="even">
138142
<td align="left"><p><code>hash_valence_shifters</code></p></td>
139143
<td align="left"><p>Valence Shifters</p></td>
140144
</tr>
141-
<tr class="even">
145+
<tr class="odd">
142146
<td align="left"><p><code>key_abbreviation</code></p></td>
143147
<td align="left"><p>Common Abbreviations</p></td>
144148
</tr>
145-
<tr class="odd">
149+
<tr class="even">
146150
<td align="left"><p><code>key_contractions</code></p></td>
147151
<td align="left"><p>Contraction Conversions</p></td>
148152
</tr>
149-
<tr class="even">
153+
<tr class="odd">
150154
<td align="left"><p><code>key_grade</code></p></td>
151155
<td align="left"><p>Grades Hash</p></td>
152156
</tr>
153-
<tr class="odd">
157+
<tr class="even">
154158
<td align="left"><p><code>key_rating</code></p></td>
155159
<td align="left"><p>Ratings Data Set</p></td>
156160
</tr>
157-
<tr class="even">
161+
<tr class="odd">
158162
<td align="left"><p><code>nrc_emotions</code></p></td>
159163
<td align="left"><p>NRC Emotions</p></td>
160164
</tr>
161-
<tr class="odd">
165+
<tr class="even">
162166
<td align="left"><p><code>pos_action_verb</code></p></td>
163167
<td align="left"><p>Action Word List</p></td>
164168
</tr>
165-
<tr class="even">
169+
<tr class="odd">
166170
<td align="left"><p><code>pos_adverb</code></p></td>
167171
<td align="left"><p>Adverb Word List</p></td>
168172
</tr>
169-
<tr class="odd">
173+
<tr class="even">
170174
<td align="left"><p><code>pos_df_pronouns</code></p></td>
171175
<td align="left"><p>Pronouns</p></td>
172176
</tr>
173-
<tr class="even">
177+
<tr class="odd">
174178
<td align="left"><p><code>pos_interjections</code></p></td>
175179
<td align="left"><p>Interjections</p></td>
176180
</tr>
177-
<tr class="odd">
181+
<tr class="even">
178182
<td align="left"><p><code>pos_preposition</code></p></td>
179183
<td align="left"><p>Preposition Words</p></td>
180184
</tr>
181-
<tr class="even">
185+
<tr class="odd">
182186
<td align="left"><p><code>sw_buckley_salton</code></p></td>
183187
<td align="left"><p>Buckley &amp; Salton Stopword List</p></td>
184188
</tr>
185-
<tr class="odd">
189+
<tr class="even">
186190
<td align="left"><p><code>sw_dolch</code></p></td>
187191
<td align="left"><p>Leveled Dolch List of 220 Common Words</p></td>
188192
</tr>
189-
<tr class="even">
193+
<tr class="odd">
190194
<td align="left"><p><code>sw_fry_100</code></p></td>
191195
<td align="left"><p>Fry's 100 Most Commonly Used English Words</p></td>
192196
</tr>
193-
<tr class="odd">
197+
<tr class="even">
194198
<td align="left"><p><code>sw_fry_1000</code></p></td>
195199
<td align="left"><p>Fry's 1000 Most Commonly Used English Words</p></td>
196200
</tr>
197-
<tr class="even">
201+
<tr class="odd">
198202
<td align="left"><p><code>sw_fry_200</code></p></td>
199203
<td align="left"><p>Fry's 200 Most Commonly Used English Words</p></td>
200204
</tr>
201-
<tr class="odd">
205+
<tr class="even">
202206
<td align="left"><p><code>sw_fry_25</code></p></td>
203207
<td align="left"><p>Fry's 25 Most Commonly Used English Words</p></td>
204208
</tr>
205-
<tr class="even">
209+
<tr class="odd">
206210
<td align="left"><p><code>sw_onix</code></p></td>
207211
<td align="left"><p>Onix Text Retrieval Toolkit Stopword List 1</p></td>
208212
</tr>

data/hash_grady_pos.rda

1.83 MB
Binary file not shown.

inst/CITATION

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,11 @@ citEntry(entry = "manual",
66
author = "Tyler W. Rinker",
77
organization = "University at Buffalo/SUNY",
88
address = "Buffalo, New York",
9-
note = "version 0.1.0",
10-
year = "2016",
9+
note = "version 0.1.1",
10+
year = "2017",
1111
url = "http://github.com/trinker/lexicon",
12-
textVersion = paste("Rinker, T. W. (2016).",
12+
textVersion = paste("Rinker, T. W. (2017).",
1313
"lexicon: Lexicon Data",
14-
"version 0.1.0. University at Buffalo. Buffalo, New York.",
14+
"version 0.1.1. University at Buffalo. Buffalo, New York.",
1515
"http://github.com/trinker/lexicon")
1616
)
Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
if (!require("pacman")) install.packages("pacman")
2+
pacman::p_load(tidyverse, textshape, textreadr, data.table, clipr)
3+
pacman::p_load_current_gh('trinker/acc.roxygen2')
4+
5+
## readin tar file
6+
loc <- 'http://www.dcs.shef.ac.uk/research/ilash/Moby/mpos.tar.Z' %>%
7+
download()
8+
9+
## untar the file
10+
untar(loc, exdir = dirname(loc))
11+
12+
13+
## read in the .txt files (words and readme pos lookup key
14+
mobyr <- readLines(file.path(dirname(loc), 'mpos/mobyposi.i'))
15+
readme <- readLines(file.path(dirname(loc), 'mpos/readme'))
16+
17+
## part of speech symbol lookup key
18+
pos_key <- readme %>%
19+
{grep("\t|\\s{3,}[A-Z]$", ., value = TRUE)} %>%
20+
trimws() %>%
21+
stringi::stri_replace_all_regex('\\s{3,}', '\t') %>%
22+
stringi::stri_replace_all_regex('(\t)+', '\t')%>%
23+
{read.csv(text = ., sep = "\t", header=FALSE, stringsAsFactors = FALSE)} %>%
24+
setNames(c('pos', 'tag'))
25+
26+
## create the words to parts of speech lexicon
27+
hash_grady_pos <- mobyr %>%
28+
data_frame(x = .) %>%
29+
extract(x, c('word', 'tag'), '(^[^×]+?)×(.+$)') %>%
30+
mutate(
31+
word = tolower(word),
32+
n_pos = nchar(tag),
33+
tag = stringi::stri_split_regex(tag, "(?=.)(?<=.)")
34+
) %>%
35+
unnest() %>%
36+
left_join(pos_key, by = 'tag') %>%
37+
filter(!grepl("[^ -~]", word)) %>%
38+
mutate(space = grepl("\\s", word)) %>%
39+
select(word, pos, n_pos, space) %>%
40+
as.data.table()
41+
42+
setkey(hash_grady_pos, 'word')
43+
44+
45+
uDT <- unique(hash_grady_pos)
46+
hash_grady_pos[, "primary":=FALSE]
47+
hash_grady_pos[uDT, primary:=TRUE, mult="first"][]
48+
49+
50+
## test hash
51+
hash_grady_pos['dog']
52+
53+
hash_grady_pos[pos == 'Pronoun', ]
54+
table(hash_grady_pos$pos)
55+
write_clip(capture.output(acc.roxygen2::dat4rox(hash_grady_pos)))
56+
write_clip(paste(paste0("\\code{", names(table(hash_grady_pos$pos)), "}"), collapse = ", "))
57+
58+
pax::new_data(hash_grady_pos)
59+
60+
61+
#' Grady Ward's Moby Parts of Speech
62+
#'
63+
#' A dataset containing a hash lookup of Grady Ward's parts of speech from the
64+
#' Moby project. The words with non-ASCII characters removed.
65+
#'
66+
#' @details
67+
#' \itemize{
68+
#' \item word. The word.
69+
#' \item pos. The part of speech; one of :\code{Adjective}, \code{Adverb}, \code{Conjunction}, \code{Definite Article}, \code{Interjection}, \code{Noun}, \code{Noun Phrase}, \code{Plural}, \code{Preposition}, \code{Pronoun}, \code{Verb (intransitive)}, \code{Verb (transitive)}, or \code{Verb (usu participle)}. Note that the first part of speech for a word is its primary use; all other uses are seondary.
70+
#' \item n_pos. The number of parts of speech associated with a word. Useful for filtering.
71+
#' \item space. logical. If \code{TRUE} the word contains a space. Useful for filtering.
72+
#' \item primary. logical. If \code{TRUE} the word is the primary part of speech used.
73+
#' }
74+
#'
75+
#' @docType data
76+
#' @keywords datasets
77+
#' @name hash_grady_pos
78+
#' @usage data(hash_grady_pos)
79+
#' @format A data frame with 250,892 rows and 5 variables
80+
#' @source \url{http://icon.shef.ac.uk/Moby/mpos.html}
81+
#' @references Moby Thesaurus List by Grady Ward: \url{http://icon.shef.ac.uk/Moby/mpos.html}
82+
#' @examples
83+
#' hash_grady_pos['dog']
84+
#' hash_grady_pos[, .SD[1], by='word']
85+
NULL
86+
87+
88+
89+

0 commit comments

Comments
 (0)