Practical Data Science

This repository houses three complete projects concerning Large Language Models (LLMs), data acquisition, dataset qualitative analysis and Natural Language Processing (NLP) tasks. These projects were developed during the MSc in Data Science of the Athens University Economics and Business.

This supporting repository features smaller projects connected to the ones showcased here. Their scope is more limited but they contain important exploration, code and experimentation on which these projects were based on.

The three projects contained in this repo can be found below:

Greek Proverb LLM Annotation

Utilizing an LLM through Prompt Engineering to provide clusterings for a Greek proverb clustering task. Estimating annotator agreement, clustering cohesion and using human clusterings to pick the best LLM clusterings. Found here.

Scraping YouTube for Language Identification and (LLM) Toxicity Detection Tasks

In this project we attempt to achieve the following goals:

Creating a language dataset including Greeklish
Crawling YouTube videos which include both Greek and Greeklish comments
Training a language identification classifier
Training a LLM-based toxicity classifier
Using the LLM classifier to produce data for, and train a traditional ML toxicity classifier
Applying our language identification and toxicity classifiers on the crawled YouTube videos identifying interesting facts and trends

LLM Text Detection

This project is an analysis on LLM detection based on this Kaggle Challenge.

We utilize different LLMs with varying prompts to generate a representative dataset of LLM generated essays. We analyze the quality of this dataset, create an optimal dataset, and train a best classifier on it. Comprised of the notebook, a README file with extra information about prompting and dataset attribution and presentation materials.

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
lang_toxic_scraping		lang_toxic_scraping
llm_detection		llm_detection
proverb_annot		proverb_annot
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Practical Data Science

Greek Proverb LLM Annotation

Scraping YouTube for Language Identification and (LLM) Toxicity Detection Tasks

LLM Text Detection

About

Releases

Packages

Languages

dimits-ts/practical_data_science

Folders and files

Latest commit

History

Repository files navigation

Practical Data Science

Greek Proverb LLM Annotation

Scraping YouTube for Language Identification and (LLM) Toxicity Detection Tasks

LLM Text Detection

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages