EFAuR

Embeddings For Authorship using RoBERTa

Dataset

The dataset used in this project consists of public domain books written in English, with a single, known author, sourced from Project Gutenberg.

Mirror all .txt files from a Project Gutenberg Mirror
Download the metadata tarball from gutenberg.org
Extract the metadata tarball and parse all the RDF metadata files within
- Select English books labeled for public domain use with a single, known author
- Write the ID of each book, along with the corresponding author, to a CSV file
Combine all books written by an author into one file
- Combine all books by the same author into a single file
- Remove Project Gutenberg headers and footers
- Remove readability formatting that may obscure the texts
Create a dataset directory
- Create subdirectories for train, test, and val
- Each subdirectory has files containing a single data point for the model
- Each subdirectory has text1, text2, and a label specifying same or different authorship
  - label is 0 if same authorship, 1 if different

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.vscode		.vscode
data		data
model		model
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt