Skip to content

Latest commit

 

History

History
297 lines (170 loc) · 31 KB

corpora_and_data_resources.rst

File metadata and controls

297 lines (170 loc) · 31 KB

Hebrew NLP Corpora and Data Resources

  • Sefaria {Each text is licensed separately} - Structured Jewish texts and metadata with free public licenses, exported from Sefaria's database. A Living Library of Jewish Texts. 3,000 years of Jewish texts in Hebrew and English translation.
  • Hebrew Songs Lyrics {CC BY-SA 4.0} - ~15,000 israeli songs scrapped from Shironet website and contains 167 different singers. Contains only Hebrew characters.
  • 1001 Israeli Pop Songs Dataset {CC BY-NC-ND 4.0} - 1001 Israeli pop songs manual analyses 1967-2017.
  • Supreme Court of Israel {OpenRAIL} - This dataset represents a 2022 snapshot of the Supreme Court of Israel public verdicts and decisions supported by rich metadata. The 5.31GB dataset represents 751,194 documents. Overall, the dataset contains 2.68 Gb of text.
  • Heb-Architecture-Corpus {CC BY 4.0} - Hebrew textual corpus of construction, planning, and architecture. The corpus consists of Hebrew documents from a wide variety of contemporary and historical sources, including legislative decrees, regulatory guidelines, research reports, academic studies, and professional journals. In the development of the corpus, it has been used digitally born as well as scanned printed publications, which go through a process of optical character recognition (OCR), cleaning, and parsing. This work was supported by the Israel Innovation Authority.
  • OSCAR {CC BY 4.0} - OSCAR or Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the Ungoliant architecture.
  • CC100 {MIT} - This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises monolingual data for 100+ languages, including Hebrew. This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots.
  • Old Newspapers {CC0 1.0} - The HC Corpora was a great resource that contains natural language text from various newspapers, social media posts and blog pages in multiple languages. This is a cleaned version of the raw data from the newspaper subset of the HC corpus.
  • TED Talks Transcripts for NLP {CC BY-NC 4.0} - Transcripts and more in 12 languages including Hebrew.
  • Knesset 2004-2005 {Public Domain} - A corpus of transcriptions of Knesset (Israeli parliament) meetings between January 2004 and November 2005. Includes tokenized and morphologically tagged versions of most of the documents in the corpus. <MILA> <Zenodo>
  • The Hebrew Treebank {GPLv3} - The Hebrew Treebank Version 2.0 contains 6500 hand-annotated sentences of news items from the MILA HaAretz Corpus, with full word segmentation and morpho-syntactic analysis. Morphological features that are not directly relevant for syntactic structures, like roots, templates and patterns, are not analyzed. This resource can be used freely for research purposes only. (temporarily down)
  • UD Hebrew Treebank {CC BY-NC-SA 4.0} - The Hebrew Universal Dependencies Treebank.
  • IAHLT-HTB {CC BY-NC-SA 4.0} - IAHLT version of the UD Hebrew Treebank. This is a revised fork of the Universal Dependencies version of the Hebrew Treebank, with some important changes and a consistency overhaul involving substantial manual corrections. The dataset was prepared as part of the Hebrew & Arabic Corpus Linguistics Infrastructure project at the Israeli Association of Human Language Technologies (IAHLT).
  • Modern Hebrew Dependency Treebank V.1 {GPLv3} - This is the Modern Hebrew Dependency Treebank which was created and used in Yoav Goldberg's PhD thesis.
  • UD Hebrew IAHLTwiki {CC-BY-SA 4.0} - Publicly available subset of the IAHLT UD Hebrew Treebank's Wikipedia section. The UD Hebrew-IAHLTWiki treebank consists of 5,000 contemporary Hebrew sentences representing a variety of texts originating from Wikipedia entries, compiled by the Israeli Association of Human Language Technology. It includes various text domains, such as: biography, law, finance, health, places, events and miscellaneous.
  • UD Hebrew - IAHLTKnesset {CC BY 4.0} - A Universal Dependencies treebank with named entities for contemporary Hebrew covering Knesset protocols.
  • The MILA corpora collection {GPLv3} - The MILA center has 20 different corpora available for free for non-commercial use. All are available in plain text format, and most have tokenized, morphologically-analyzed, and morphologically-disambiguated versions available too. (temporarily down)
  • NEMO {CC BY 4.0} - Named Entity (NER) annotations of the Hebrew Treebank (Haaretz newspaper) corpus, including: morpheme and token level NER labels, nested mentions, and more. The following entity types are tagged: Person, Organization, Geo-Political Entity, Location, Facility, Work-of-Art, Event, Product, Language.
  • MDTEL {MIT} - A dataset of posts from the www.camoni.co.il, tagged with medical entities from the UMLS, and a code that recognize medical entities in the Hebrew text.
  • Ben-Mordecai and Elhadad's Corpus {?} - Newspaper articles in different fields: news, economy, fashion and gossip. The following entity types are tagged: entity names (person, location, organization), temporal expression (date, time) and number expression (percent, money). Demo
  • UD Hebrew - IAHLTKnesset {CC BY 4.0} - A Universal Dependencies treebank with named entities for contemporary Hebrew covering Knesset protocols.
  • Knesset 2004-2005 {Public Domain} - A corpus of transcriptions of Knesset (Israeli parliament) meetings between January 2004 and November 2005. Includes tokenized and morphologically tagged versions of most of the documents in the corpus. <MILA> <Zenodo>
  • The Hebrew Treebank {GPLv3} - The Hebrew Treebank Version 2.0 contains 6500 hand-annotated sentences of news items from the MILA HaAretz Corpus, with full word segmentation and morpho-syntactic analysis. Morphological features that are not directly relevant for syntactic structures, like roots, templates and patterns, are not analyzed. This resource can be used freely for research purposes only. (temporarily down)
  • UD Hebrew Treebank {CC BY-NC-SA 4.0} - The Hebrew Universal Dependencies Treebank.
  • IAHLT-HTB {CC BY-NC-SA 4.0} - IAHLT version of the UD Hebrew Treebank. This is a revised fork of the Universal Dependencies version of the Hebrew Treebank, with some important changes and a consistency overhaul involving substantial manual corrections. The dataset was prepared as part of the Hebrew & Arabic Corpus Linguistics Infrastructure project at the Israeli Association of Human Language Technologies (IAHLT).
  • Modern Hebrew Dependency Treebank V.1 {GPLv3} - This is the Modern Hebrew Dependency Treebank which was created and used in Yoav Goldberg's PhD thesis.
  • UD Hebrew IAHLTwiki {CC-BY-SA 4.0} - Publicly available subset of the IAHLT UD Hebrew Treebank's Wikipedia section. The UD Hebrew-IAHLTWiki treebank consists of 5,000 contemporary Hebrew sentences representing a variety of texts originating from Wikipedia entries, compiled by the Israeli Association of Human Language Technology. It includes various text domains, such as: biography, law, finance, health, places, events and miscellaneous.
  • UD Hebrew - IAHLTKnesset {CC BY 4.0} - A Universal Dependencies treebank with named entities for contemporary Hebrew covering Knesset protocols.
  • The Hebrew Language Corpus - Morphological Annotation (קורפוס השפה העברית - תיוג מורפולוגי) {Open} - An annotated Hebrew database published as part of the Hebrew Language Corpus Project of Israel National Digital Agency and The Academy of the Hebrew Language.
  • The MILA corpora collection {GPLv3} - The MILA center has 20 different corpora available for free for non-commercial use. All are available in plain text format, and most have tokenized, morphologically-analyzed, and morphologically-disambiguated versions available too. (temporarily down)
  • HeQ {CC BY 4.0} - a question answering dataset in Modern Hebrew, consisting of 30,147 questions. The dataset follows the format and crowdsourcing methodology of SQuAD (Stanford Question Answering Dataset) and the original ParaShoot. A team of crowdworkers formulated and answered reading comprehension questions based on random paragraphs in Hebrew. The answer to each question is a segment of text (span) included in the relevant paragraph. The paragraphs are sourced from two different platforms: (1) Hebrew Wikipedia, and (2) Geektime, an online Israeli news channel specializing in technology.
  • ParaShoot {?} - A Hebrew question and answering dataset in the style of SQuAD, created by Omri Keren and Omer Levy. ParaShoot is based on articles scraped from Wikipedia. The dataset contains 3K crowdsource-annotated pairs of questions and answers, in a setting suitable for few-shot learning.
  • HebWiki QA {?} Translated (by google translation API) SQUAD dataset from English to Hebrew. The translation process included fixation and removal of bad translations.
  • Hebrew-Sentiment-Data Amram et al. {?} - A sentiment analysis benchmark (positive, negative and neutral sentiment) for Hebrew, based on 12K social media comments, containing two instances of input items: token-based and morpheme-based. A cleaned version of the Hebrew Sentiment dataset - a test-train data leakage was cleaned.
  • Emotion User Generated Content (UGC) {MIT} - collected for HeBERT model and includes comments posted on news articles collected from 3 major Israeli news sites, between January 2020 to August 2020. The total size of the data is ~150 MB, including over 7 millions words and 350K sentences. ~2000 sentences were annotated by crowd members (3-10 annotators per sentence) for overall sentiment (polarity) and eight emotions.
  • Sentiment HebrewDataset {MIT} - The sentiment analysis dataset contains 75,152 tagged sentences from 3 categories: economy, news (mostly politics) and sport. All the sentences were annotated by crowd members (2-5 annotators) to sentiment: positive, negative or neutral. This dataset was created by SUMIT-AI company, thanks to joint funding of the NNLP-IL.
  • HebrewSentiment {?}
  • Emotion User Generated Content (UGC) {MIT} - collected for HeBERT model and includes comments posted on news articles collected from 3 major Israeli news sites, between January 2020 to August 2020. The total size of the data is ~150 MB, including over 7 millions words and 350K sentences. ~2000 sentences were annotated by crowd members (3-10 annotators per sentence) for overall sentiment (polarity) and eight emotions: anger, disgust, expectation , fear, happiness, sadness, surprise and trust.
  • Knesset Topic Classification {?} - This data was collected as a part of Nitzan Barzilay's project and contains about 2,700 quotes from Knesset meetings, manually classified into eight topics: education, Covid-19, welfare, economic, women and LGBT, health, security, internal security.
  • Criminal Sentence Classification {OpenRAIL} - This project classifies key aspects of criminal cases within the Israeli legal framework. The project leverages a few-shot learning approach for accurate sentence classification relevant to sentencing decisions.
  • ThinkIL {CC-BY-SA 3.0} - An archive of the writings of Zvi Yanai.
  • HeGeL {?} - A novel dataset for Hebrew Geo-Location, the first ever Hebrew NLU benchmark involving both grounding and geospatial reasoning, created with crowdsourced 5,649 geospatially-oriented Hebrew place descriptions of various place types from three cities in Israel.
  • HebNLI {CC BY 4.0} - Based on MultiNLI, a large crowd-sourced corpus of sentences from varied genres and writing styles in the English language. To adapt this resource for Hebrew, the corpus was translated from English using machine translation (Google Gemini). This dataset was created by Webiks for MAFAT, as part of the National Natural Language Processing Plan of Israel.
  • HeSum {?} - A novel benchmark specifically designed for abstractive text summarization in Modern Hebrew. HeSum consists of 10,000 articles paired with their corresponding summaries, all of which have been sourced from three different Hebrew news websites, all written by professional journalists.
  • The HUJI Corpus of Spoken Hebrew {CC BY 4.0} - The corpus project, created by Dr Michal Marmorstein, Nadav Matalon, Amir Efrati, Itamar Folman and Yuval Geva, and hosted by the Hebrew University of Jerusalem (HUJI), aims at documenting naturally occurring speech and interaction in Modern Hebrew. Data come from telephone conversations recorded during the years 2020–2021. Data annotation followed standard methods of Interactional Linguistics (Couper-Kuhlen and Selting 2018). Audio files and transcripts were made freely accessible online.
  • CoSIH - The Corpus of Spoken Hebrew {?} - The Corpus of Spoken Israeli Hebrew (CoSIH) is a database of recordings of spoken Israeli Hebrew
  • MaTaCOp {?} - a corpus of Hebrew dialogues within the Map Task framework (allowed for non-commercial research and teaching purposes only)
  • HaArchion {?} - Recording of various Hebrew prose and poetry being read. (temporarily down)
  • Robo-Shaul (רובו-שאול) {?} - Transcribed audio recordings (30 hours) of an Israeli economics podcast (חיות כיס).
  • ivrit.ai {CC BY 4.0} - A comprehensive Hebrew speech dataset designed for AI research and development. It contains approximately 3,300 hours of Hebrew speech, collected from a diverse range of online platforms including podcasts and other audio content.
  • Hebrew Medical Audio Dataset - Verbit {CC BY-NC 4.0} - This dataset is published by Verbit.ai and contains over one thousand audio recordings of invented clinical summaries by 41 different speakers. Each recording is in Hebrew and represents a summary of a patient's visit, providing valuable insights into clinical interactions, diagnosis, treatment plans, and follow-up procedures. The recordings do not contain any personal or private information.
  • HebDB {CC BY 4.0} - Weakly supervised dataset for spoken language processing in the Hebrew language. HebDB offers roughly 2500 hours of natural and spontaneous speech recordings in the Hebrew language. Raw recordings are provided together with a pre-processed, weakly supervised, and filtered version.
  • The BGU morphological lexicon (not yet released)
  • The morphological lexicon of the Israeli National Institute for Testing and Evaluation (not yet released)
  • The MILA lexicon of Hebrew words {GPLv3} - The lexicon was designed mainly for usage by morphological analyzers, but is being constantly extended to facilitate other applications as well. The lexicon contains about 25,000 lexicon items and is extended regularly. Free for non-commercial use. (temporarily down)
  • MILA's Verb Complements Lexicon {GPLv3}
  • Hebrew Psychological Lexicons {CC-BY-SA 4.0} - Natalie Shapira's large collection of Hebrew psychological lexicons and word lists. Useful for various psychology applications such as detecting emotional state, well being, relationship quality in conversation, identifying topics (e.g., family, work) and many more.
  • Hebrew WordNet {GPLv3} - Hebrew WordNet uses the MultiWordNet methodology and is aligned with the one developed at IRST (and therefore is aligned with English, Italian and Spanish). Free for non-commercial use. (temporarily down)
  • Sentiment lexicon {GPLv3} - Sentiment analysis, the task of automatically detecting whether a piece of text is positive or negative, generally relies on a hand-curated list of words with positive sentiment (good, great, awesome) and negative sentiment (bad, gross, awful). This dataset contains both positive and negative sentiment lexicons for 81 languages.
  • word2word {Apache License 2.0} - Easy-to-use word-to-word translations for 3,564 language pairs. Hebrew is one of the 62 supported languages, and thus word-to-word translation to/from Hebrew is supported for 61 languages.
  • Eran Tomer's Digital Vocalized Text Corpus {Apache License 2.0} - A corpus of digital vocalized Hebrew texts created by Eran Tomer as part of his Master thesis. The corpus is found in the resources folder.
  • MILA's Hebrew Stopwords List {GPLv3} - An Excel XLSX file containing 23,327 Hebrew tokens in descending order of frequency.
  • Tapuz Hebrew Stop Words - a list of the 500 most common words (stop words) computed from discussions from the Tapuz People website, on a variety of subjects. (Data files © Original Authors)
  • Stop words {GPLv2} - Stop words in 28 languages.
  • Hebrew verb lists {CC-BY 4.0} - Created by Eran Tomer ([email protected]).
  • Hebrew name lists {CC-BY 4.0} - Lists of street, company, given and last names. Created by Guy Laybovitz.
  • Most Common Hebrew Verbs on Twitter - 1000 most frequent words in Hebrew tweets during (roughly) 2018.
  • KIMA - the Historical Hebrew Gazetteer - Place Names in the Hebrew Script. An open, attestation based, historical database. Kima currently holds 27,239 Places, with 94,650 alternate variants of their names and 236,744 attestations of these variants.
  • Wikidata Lexemes {CC0 1.0} - over 500K conjugations with morphological analysis, mainly based on Hspell. Can be queried using http://query.wikidata.org/ - Uploaded by Uziel302
  • Most Common Hebrew Words on Twitter - Hebrew most common words by Twitter based on tweets from March 2018 to March 2019.
  • Hebrew WordLists {AGPL-3.0} - Useful word lists extracted from Hspell 1.4 by Eyal Gruss.
  • Hebrew stop word base on the UD {CC-BY-SA 4.0} - List of stop words in Hebrew produced by using Universal Dependencies of the The Israeli Association of Human Language Technologies (IAHLT).
  • The Word-Frequency Database for Printed Hebrew - supplies the frequency of occurrence of any Hebrew letter cluster (mean occurrence per million). The corpus was assembled throughout the year 2001, and consists of text downloaded from 914 editions of the three major daily online Hebrew newspapers (Haaretz, Maariv, and Yediot Acharonot). After removing abbreviations, single characters, forms with counts that are less than 3 (mostly typos), and splitting hyphenated forms (vast majority were two words), the corpus totals 554,270 types and 619,835,788 tokens. (©The Hebrew University of Jerusalem)
  • hebrew-w2v {Apache License 2.0} - Iddo Yadlin and Itamar Shefi's word2vec model for Hebrew, trained on a corpus which is the Hebrew wikipedia dump only tokenized with hebpipe.
  • BEREL {?} - BERT Embeddings for Rabbinic-Encoded Language - DICTA's pre-trained language model (PLM) for Rabbinic Hebrew.