hash_grady_pos added to provide a lookup of Grady's parts of speech for words.

trinker · trinker · commit a9a235b24d26 · 2017-01-28T14:22:03.000-05:00
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,13 +1,13 @@
 Package: lexicon
-Title: Lexicons
-Version: 0.1.0
+Title: Lexicons for Text Analysis
+Version: 0.1.1
 Authors@R: c(person("Tyler", "Rinker", email =
         "tyler.rinker@gmail.com", role = c("aut", "cre")))
 Maintainer: Tyler Rinker <tyler.rinker@gmail.com>
 Description: A collection of lexical hash tables, dictionaries, and
         word lists.
 Depends: R (>= 3.2.2)
-Date: 2017-01-15
+Date: 2017-01-28
 License: MIT + file LICENSE
 LazyData: TRUE
 Roxygen: list(wrap = FALSE)
diff --git a/NEWS b/NEWS
@@ -17,24 +17,26 @@ And constructed with the following guidelines:
 * Bug fixes and misc changes bumps the patch
 
 
-lexicon 0.1.0 -
+lexicon 0.1.1 -
 ----------------------------------------------------------------
 
-BUG FIXES
-
 NEW FEATURES
 
-* The `ratings` and `grades` keys from *sentimentr* have been moved to the
-  *lexicon* package and renamed to `key_rating` and `key_grade`.
+* `hash_grady_pos` added to provide a lookup of Grady's parts of speech for words.
+
 
-MINOR FEATURES
+lexicon 0.1.0
+----------------------------------------------------------------
+
+NEW FEATURES
+
+* The `ratings` and `grades` keys from **sentimentr** have been moved to the
+  **lexicon** package and renamed to `key_rating` and `key_grade`.
 
 IMPROVEMENTS
 
 * Added the positve terms 'spot on', 'on time', & 'on point' to `hash_sentiment`.
 
-CHANGES
-
 
 lexicon 0.0.1
 ----------------------------------------------------------------
diff --git a/NEWS.md b/NEWS.md
@@ -17,24 +17,26 @@ And constructed with the following guidelines:
 * Bug fixes and misc changes bumps the patch
 
 
-lexicon 0.1.0 -
+lexicon 0.1.1 -
 ----------------------------------------------------------------
 
-**BUG FIXES**
-
 **NEW FEATURES**
 
-* The `ratings` and `grades` keys from *sentimentr* have been moved to the
-  *lexicon* package and renamed to `key_rating` and `key_grade`.
+* `hash_grady_pos` added to provide a lookup of Grady's parts of speech for words.
+
 
-**MINOR FEATURES**
+lexicon 0.1.0
+----------------------------------------------------------------
+
+**NEW FEATURES**
+
+* The `ratings` and `grades` keys from **sentimentr** have been moved to the
+  **lexicon** package and renamed to `key_rating` and `key_grade`.
 
 **IMPROVEMENTS**
 
 * Added the positve terms 'spot on', 'on time', & 'on point' to `hash_sentiment`.
 
-**CHANGES**
-
 
 lexicon 0.0.1
 ----------------------------------------------------------------
diff --git a/R/hash_grady_pos.R b/R/hash_grady_pos.R
@@ -0,0 +1,30 @@
+#' Grady Ward's Moby Parts of Speech
+#'
+#' A dataset containing a hash lookup of Grady Ward's parts of speech from the
+#' Moby project.  The words with non-ASCII characters removed.
+#'
+#' @details
+#' \itemize{
+#'   \item word. The word.
+#'   \item pos. The part of speech; one of :\code{Adjective}, \code{Adverb}, \code{Conjunction}, \code{Definite Article}, \code{Interjection}, \code{Noun}, \code{Noun Phrase}, \code{Plural}, \code{Preposition}, \code{Pronoun}, \code{Verb (intransitive)}, \code{Verb (transitive)}, or \code{Verb (usu participle)}.  Note that the first part of speech for a word is its primary use; all other uses are seondary.
+#'   \item n_pos. The number of parts of speech associated with a word.  Useful for filtering.
+#'   \item space. logical.  If \code{TRUE} the word contains a space.  Useful for filtering.
+#'   \item primary. logical.  If \code{TRUE} the word is the primary part of speech used.
+#' }
+#'
+#' @docType data
+#' @keywords datasets
+#' @name hash_grady_pos
+#' @usage data(hash_grady_pos)
+#' @format A data frame with 250,892 rows and 5 variables
+#' @source \url{http://icon.shef.ac.uk/Moby/mpos.html}
+#' @references Moby Thesaurus List by Grady Ward: \url{http://icon.shef.ac.uk/Moby/mpos.html}
+#' @examples
+#' \dontrun{
+#' library(data.table)
+#'
+#' hash_grady_pos['dog']
+#' hash_grady_pos[primary == TRUE, ]
+#' hash_grady_pos[primary == TRUE & space == FALSE, ]
+#' }
+NULL
diff --git a/R/lexicon-package.R b/R/lexicon-package.R
@@ -1,4 +1,4 @@
-#' Lexicons
+#' Lexicons for Text Analysis
 #'
 #' A collection of lexical hash tables, dictionaries, and word lists.
 #' @docType package
diff --git a/README.md b/README.md
@@ -8,7 +8,7 @@ developed.](http://www.repostatus.org/badges/0.1.0/active.svg)](http://www.repos
 [![Build
 Status](https://travis-ci.org/trinker/lexicon.svg?branch=master)](https://travis-ci.org/trinker/lexicon)
 [![](http://cranlogs.r-pkg.org/badges/lexicon)](https://cran.r-project.org/package=lexicon)
-<a href="https://img.shields.io/badge/Version-0.1.0-orange.svg"><img src="https://img.shields.io/badge/Version-0.1.0-orange.svg" alt="Version"/></a>
+<a href="https://img.shields.io/badge/Version-0.1.1-orange.svg"><img src="https://img.shields.io/badge/Version-0.1.1-orange.svg" alt="Version"/></a>
 </p>
 <img src="inst/lexicon_logo/r_lexicon.png" width="135" alt="lexicon Logo">
 
@@ -111,98 +111,102 @@ Data
 <td align="left"><p>Emoticons</p></td>
 </tr>
 <tr class="odd">
+<td align="left"><p><code>hash_grady_pos</code></p></td>
+<td align="left"><p>Grady Ward's Moby Parts of Speech</p></td>
+</tr>
+<tr class="even">
 <td align="left"><p><code>hash_power</code></p></td>
 <td align="left"><p>Power Lookup Key</p></td>
 </tr>
-<tr class="even">
+<tr class="odd">
 <td align="left"><p><code>hash_sentiment</code></p></td>
 <td align="left"><p>Polarity Lookup Key</p></td>
 </tr>
-<tr class="odd">
+<tr class="even">
 <td align="left"><p><code>hash_sentiment_nrc</code></p></td>
 <td align="left"><p>NRC Sentiment Lookup Key</p></td>
 </tr>
-<tr class="even">
+<tr class="odd">
 <td align="left"><p><code>hash_sentiword</code></p></td>
 <td align="left"><p>Augmented Sentiword</p></td>
 </tr>
-<tr class="odd">
+<tr class="even">
 <td align="left"><p><code>hash_strength</code></p></td>
 <td align="left"><p>Strength Lookup Key</p></td>
 </tr>
-<tr class="even">
+<tr class="odd">
 <td align="left"><p><code>hash_syllable</code></p></td>
 <td align="left"><p>Syllable Counts</p></td>
 </tr>
-<tr class="odd">
+<tr class="even">
 <td align="left"><p><code>hash_valence_shifters</code></p></td>
 <td align="left"><p>Valence Shifters</p></td>
 </tr>
-<tr class="even">
+<tr class="odd">
 <td align="left"><p><code>key_abbreviation</code></p></td>
 <td align="left"><p>Common Abbreviations</p></td>
 </tr>
-<tr class="odd">
+<tr class="even">
 <td align="left"><p><code>key_contractions</code></p></td>
 <td align="left"><p>Contraction Conversions</p></td>
 </tr>
-<tr class="even">
+<tr class="odd">
 <td align="left"><p><code>key_grade</code></p></td>
 <td align="left"><p>Grades Hash</p></td>
 </tr>
-<tr class="odd">
+<tr class="even">
 <td align="left"><p><code>key_rating</code></p></td>
 <td align="left"><p>Ratings Data Set</p></td>
 </tr>
-<tr class="even">
+<tr class="odd">
 <td align="left"><p><code>nrc_emotions</code></p></td>
 <td align="left"><p>NRC Emotions</p></td>
 </tr>
-<tr class="odd">
+<tr class="even">
 <td align="left"><p><code>pos_action_verb</code></p></td>
 <td align="left"><p>Action Word List</p></td>
 </tr>
-<tr class="even">
+<tr class="odd">
 <td align="left"><p><code>pos_adverb</code></p></td>
 <td align="left"><p>Adverb Word List</p></td>
 </tr>
-<tr class="odd">
+<tr class="even">
 <td align="left"><p><code>pos_df_pronouns</code></p></td>
 <td align="left"><p>Pronouns</p></td>
 </tr>
-<tr class="even">
+<tr class="odd">
 <td align="left"><p><code>pos_interjections</code></p></td>
 <td align="left"><p>Interjections</p></td>
 </tr>
-<tr class="odd">
+<tr class="even">
 <td align="left"><p><code>pos_preposition</code></p></td>
 <td align="left"><p>Preposition Words</p></td>
 </tr>
-<tr class="even">
+<tr class="odd">
 <td align="left"><p><code>sw_buckley_salton</code></p></td>
 <td align="left"><p>Buckley &amp; Salton Stopword List</p></td>
 </tr>
-<tr class="odd">
+<tr class="even">
 <td align="left"><p><code>sw_dolch</code></p></td>
 <td align="left"><p>Leveled Dolch List of 220 Common Words</p></td>
 </tr>
-<tr class="even">
+<tr class="odd">
 <td align="left"><p><code>sw_fry_100</code></p></td>
 <td align="left"><p>Fry's 100 Most Commonly Used English Words</p></td>
 </tr>
-<tr class="odd">
+<tr class="even">
 <td align="left"><p><code>sw_fry_1000</code></p></td>
 <td align="left"><p>Fry's 1000 Most Commonly Used English Words</p></td>
 </tr>
-<tr class="even">
+<tr class="odd">
 <td align="left"><p><code>sw_fry_200</code></p></td>
 <td align="left"><p>Fry's 200 Most Commonly Used English Words</p></td>
 </tr>
-<tr class="odd">
+<tr class="even">
 <td align="left"><p><code>sw_fry_25</code></p></td>
 <td align="left"><p>Fry's 25 Most Commonly Used English Words</p></td>
 </tr>
-<tr class="even">
+<tr class="odd">
 <td align="left"><p><code>sw_onix</code></p></td>
 <td align="left"><p>Onix Text Retrieval Toolkit Stopword List 1</p></td>
 </tr>
diff --git a/data/hash_grady_pos.rda b/data/hash_grady_pos.rda
diff --git a/inst/CITATION b/inst/CITATION
@@ -6,11 +6,11 @@ citEntry(entry = "manual",
     author = "Tyler W. Rinker",
     organization = "University at Buffalo/SUNY",
     address = "Buffalo, New York",
-    note = "version 0.1.0",
-    year = "2016",
+    note = "version 0.1.1",
+    year = "2017",
     url = "http://github.com/trinker/lexicon",
-    textVersion  = paste("Rinker, T. W. (2016).",
+    textVersion  = paste("Rinker, T. W. (2017).",
         "lexicon: Lexicon Data",
-        "version 0.1.0. University at Buffalo. Buffalo, New York.",
+        "version 0.1.1. University at Buffalo. Buffalo, New York.",
         "http://github.com/trinker/lexicon")
 )
diff --git a/inst/scraping_scripts/moby_scrape.R b/inst/scraping_scripts/moby_scrape.R
@@ -0,0 +1,89 @@
+if (!require("pacman")) install.packages("pacman")
+pacman::p_load(tidyverse, textshape, textreadr, data.table, clipr)
+pacman::p_load_current_gh('trinker/acc.roxygen2')
+
+## readin tar file
+loc <- 'http://www.dcs.shef.ac.uk/research/ilash/Moby/mpos.tar.Z' %>%
+    download()
+
+## untar the file
+untar(loc, exdir = dirname(loc))
+
+
+## read in the .txt files (words and readme pos lookup key
+mobyr <- readLines(file.path(dirname(loc), 'mpos/mobyposi.i'))
+readme <- readLines(file.path(dirname(loc), 'mpos/readme'))
+
+## part of speech symbol lookup key
+pos_key <- readme %>%
+    {grep("\t|\\s{3,}[A-Z]$", ., value = TRUE)}  %>%
+    trimws() %>%
+    stringi::stri_replace_all_regex('\\s{3,}', '\t') %>%
+    stringi::stri_replace_all_regex('(\t)+', '\t')%>%
+    {read.csv(text = ., sep = "\t", header=FALSE, stringsAsFactors = FALSE)} %>%
+    setNames(c('pos', 'tag'))
+
+## create the words to parts of speech lexicon
+hash_grady_pos <- mobyr %>%
+    data_frame(x = .) %>%
+    extract(x, c('word', 'tag'), '(^[^×]+?)×(.+$)') %>%
+    mutate(
+        word = tolower(word),
+        n_pos = nchar(tag),
+        tag = stringi::stri_split_regex(tag, "(?=.)(?<=.)")
+    ) %>%
+    unnest() %>%
+    left_join(pos_key, by = 'tag') %>%
+    filter(!grepl("[^ -~]", word)) %>%
+    mutate(space = grepl("\\s", word)) %>%
+    select(word, pos, n_pos, space) %>%
+    as.data.table()
+
+setkey(hash_grady_pos, 'word')
+
+
+uDT <- unique(hash_grady_pos)
+hash_grady_pos[, "primary":=FALSE]
+hash_grady_pos[uDT, primary:=TRUE, mult="first"][]
+
+
+## test hash
+hash_grady_pos['dog']
+
+hash_grady_pos[pos == 'Pronoun', ]
+table(hash_grady_pos$pos)
+write_clip(capture.output(acc.roxygen2::dat4rox(hash_grady_pos)))
+write_clip(paste(paste0("\\code{", names(table(hash_grady_pos$pos)), "}"), collapse = ", "))
+
+pax::new_data(hash_grady_pos)
+
+
+#' Grady Ward's Moby Parts of Speech
+#'
+#' A dataset containing a hash lookup of Grady Ward's parts of speech from the
+#' Moby project.  The words with non-ASCII characters removed.
+#'
+#' @details
+#' \itemize{
+#'   \item word. The word.
+#'   \item pos. The part of speech; one of :\code{Adjective}, \code{Adverb}, \code{Conjunction}, \code{Definite Article}, \code{Interjection}, \code{Noun}, \code{Noun Phrase}, \code{Plural}, \code{Preposition}, \code{Pronoun}, \code{Verb (intransitive)}, \code{Verb (transitive)}, or \code{Verb (usu participle)}.  Note that the first part of speech for a word is its primary use; all other uses are seondary.
+#'   \item n_pos. The number of parts of speech associated with a word.  Useful for filtering.
+#'   \item space. logical.  If \code{TRUE} the word contains a space.  Useful for filtering.
+#'   \item primary. logical.  If \code{TRUE} the word is the primary part of speech used.
+#' }
+#'
+#' @docType data
+#' @keywords datasets
+#' @name hash_grady_pos
+#' @usage data(hash_grady_pos)
+#' @format A data frame with 250,892 rows and 5 variables
+#' @source \url{http://icon.shef.ac.uk/Moby/mpos.html}
+#' @references Moby Thesaurus List by Grady Ward: \url{http://icon.shef.ac.uk/Moby/mpos.html}
+#' @examples
+#' hash_grady_pos['dog']
+#' hash_grady_pos[, .SD[1], by='word']
+NULL
+
+
+
+
diff --git a/man/hash_grady_pos.Rd b/man/hash_grady_pos.Rd
diff --git a/man/lexicon.Rd b/man/lexicon.Rd

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-#' Lexicons`
	`1`	`+#' Lexicons for Text Analysis`
`2`	`2`	`#'`
`3`	`3`	`#' A collection of lexical hash tables, dictionaries, and word lists.`
`4`	`4`	`#' @docType package`