kalidasa is a repository of the data from the Digital Corpus of Sanskrit, designed for programmatic text analysis.
Important
This is a work in progress.
Currently you should use the devtools pacakge to install kalidasa. When I
sort out the final organization of the datasets I plan to submit to CRAN.
# install.pacakges("devtools")
devtools::install_github("mghaight/kalidasa")kalidasa includes three datasets and several helper functions. dcs_meta
includes corpus metadata for each text, including full title, author, time
period, and subject/genre. dcs_raw includes a list of character vectors for
each text, divided by chapter. dcs_rich includes lemmata data and grammatical
analysis in a tidy format.
There are several helper functions to make querying the data easier. Since all
of the texts and titles are transliterated according to the
IAST
schema and encoded as UTF-8 text, kalidasa makes use of unique text_ids
to interface with the package data. These text_ids are consistent with the
IDs used in the DCS API. The function print_titles to list the available
texts and their text_id for easy lookup. search_texts does a fuzzy search
for text titles returning the top matches. get_search returns the dcs_raw
data for the top result of a query. get_text—aptly—gets the
dcs_raw data for a specificed text_id and optional chapter range.
There is also a function remove_stopwords which can be called on dcs_rich
or any subelement text to remove rows of data that are included in a custom
stoplist. The stopwords were generated according a hybrid approach of TF-IDF
scores, manual selection and the method described in this
paper. Lastly, there is a
function dcs_write which writes all the data to a location on disk and
returns a vector of filepaths.
All data was scraped from the DCS, which is prepared by Oliver Hellwig and licensed under CC-BY 4.0.
