deploy: 24c2b22

hezarai · Aug 23, 2023 · aaed20c · aaed20c
commit aaed20c
Show file tree

Hide file tree

Showing 94 changed files with 11,905 additions and 0 deletions.
diff --git a/.buildinfo b/.buildinfo
@@ -0,0 +1,4 @@
+# Sphinx build info version 1
+# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
+config: cdec0220fc6fa5a0b01e3d9fbaabfd7d
+tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/.doctrees/contribute/add_datasets.doctree b/.doctrees/contribute/add_datasets.doctree
diff --git a/.doctrees/contribute/add_docs.doctree b/.doctrees/contribute/add_docs.doctree
diff --git a/.doctrees/contribute/add_models.doctree b/.doctrees/contribute/add_models.doctree
diff --git a/.doctrees/contribute/add_tests.doctree b/.doctrees/contribute/add_tests.doctree
diff --git a/.doctrees/contribute/contribute_to_hezar.doctree b/.doctrees/contribute/contribute_to_hezar.doctree
diff --git a/.doctrees/contribute/index.doctree b/.doctrees/contribute/index.doctree
diff --git a/.doctrees/contribute/pull_requests.doctree b/.doctrees/contribute/pull_requests.doctree
diff --git a/.doctrees/environment.pickle b/.doctrees/environment.pickle
diff --git a/.doctrees/get_started/index.doctree b/.doctrees/get_started/index.doctree
diff --git a/.doctrees/get_started/installation.doctree b/.doctrees/get_started/installation.doctree
diff --git a/.doctrees/get_started/overview.doctree b/.doctrees/get_started/overview.doctree
diff --git a/.doctrees/get_started/quick_tour.doctree b/.doctrees/get_started/quick_tour.doctree
diff --git a/.doctrees/guide/hezar_architecture.doctree b/.doctrees/guide/hezar_architecture.doctree
diff --git a/.doctrees/guide/index.doctree b/.doctrees/guide/index.doctree
diff --git a/.doctrees/guide/models_in_depth_overview.doctree b/.doctrees/guide/models_in_depth_overview.doctree
diff --git a/.doctrees/guide/train_custom_models.doctree b/.doctrees/guide/train_custom_models.doctree
diff --git a/.doctrees/index.doctree b/.doctrees/index.doctree
diff --git a/.doctrees/source/index.doctree b/.doctrees/source/index.doctree
diff --git a/.doctrees/tutorial/datasets.doctree b/.doctrees/tutorial/datasets.doctree
diff --git a/.doctrees/tutorial/index.doctree b/.doctrees/tutorial/index.doctree
diff --git a/.doctrees/tutorial/models.doctree b/.doctrees/tutorial/models.doctree
diff --git a/.doctrees/tutorial/preprocessors.doctree b/.doctrees/tutorial/preprocessors.doctree
diff --git a/.doctrees/tutorial/training.doctree b/.doctrees/tutorial/training.doctree
diff --git a/.nojekyll b/.nojekyll
diff --git a/_sources/contribute/add_datasets.md.txt b/_sources/contribute/add_datasets.md.txt
@@ -0,0 +1 @@
+# Add a Dataset
diff --git a/_sources/contribute/add_docs.md.txt b/_sources/contribute/add_docs.md.txt
@@ -0,0 +1 @@
+# Contribute to Docs
diff --git a/_sources/contribute/add_models.md.txt b/_sources/contribute/add_models.md.txt
@@ -0,0 +1 @@
+# Add a Model
diff --git a/_sources/contribute/add_tests.md.txt b/_sources/contribute/add_tests.md.txt
@@ -0,0 +1 @@
+# Add Tests
diff --git a/_sources/contribute/contribute_to_hezar.md.txt b/_sources/contribute/contribute_to_hezar.md.txt
@@ -0,0 +1 @@
+# Contribute to Hezar
diff --git a/_sources/contribute/index.md.txt b/_sources/contribute/index.md.txt
@@ -0,0 +1,10 @@
+# Contribute
+
+```{toctree}
+contribute_to_hezar.md
+add_models.md
+add_datasets.md
+add_docs.md
+add_tests.md
+pull_requests.md
+```
diff --git a/_sources/contribute/pull_requests.md.txt b/_sources/contribute/pull_requests.md.txt
@@ -0,0 +1 @@
+# Sending a Pull Request
diff --git a/_sources/get_started/index.md.txt b/_sources/get_started/index.md.txt
@@ -0,0 +1,8 @@
+# Get Started
+```{toctree}
+:maxdepth: 1
+
+overview.md
+installation.md
+quick_tour.md
+```
diff --git a/_sources/get_started/installation.md.txt b/_sources/get_started/installation.md.txt
@@ -0,0 +1,25 @@
+# Installation
+
+#### Install from PyPi
+Installing Hezar is as easy as any other Python library! Most of the requirements are cross-platform and installing 
+them on any machine is a piece of cake!
+
+```
+pip install hezar
+```
+#### Install from source
+Also, you can install the dev version of the library using the source:
+```
+pip install git+https://github.com/hezarai/hezar.git
+```
+
+#### Test installation
+From a Python console or in CLI just import `hezar` and check the version:
+```python
+import hezar
+
+print(hezar.__version__)
+```
+```
+0.23.1
+```
diff --git a/_sources/get_started/overview.md.txt b/_sources/get_started/overview.md.txt
@@ -0,0 +1,20 @@
+# Overview
+
+Welcome to Hezar! A library that makes state-of-the-art machine learning as easy as possible aimed for the Persian 
+language, built by the Persian community!
+
+In Hezar, the primary goal is to provide plug-and-play AI/ML utilities so that you don't need to know much about what's
+going on under the hood. Hezar is not just a model library, but instead it's packed with every aspect you need for any 
+ML pipeline like datasets, trainers, preprocessors, feature extractors, etc.
+
+Hezar is a library that:
+- brings together all the best works in AI for Persian
+- makes using AI models as easy as a couple of lines of code
+- seamlessly integrates with Hugging Face Hub for all of its models
+- has a highly developer-friendly interface
+- has a task-based model interface which is more convenient for general users.
+- is packed with additional tools like word embeddings, tokenizers, feature extractors, etc.
+- comes with a lot of supplementary ML tools for deployment, benchmarking, optimization, etc.
+- and more!
+
+To find out more, just take the [quick tour](quick_tour.md)!
diff --git a/_sources/get_started/quick_tour.md.txt b/_sources/get_started/quick_tour.md.txt
@@ -0,0 +1,151 @@
+# Quick Tour
+Let's have a quick tour on some of the most important features of Hezar!
+
+### Models
+There's a bunch of ready to use trained models for different tasks on the Hub. To see all the models see [here](https://huggingface.co/hezarai)!
+
+- **Text classification (sentiment analysis, categorization, etc)** 
+```python
+from hezar import Model
+
+example = ["هزار، کتابخانه‌ای کامل برای به کارگیری آسان هوش مصنوعی"]
+model = Model.load("hezarai/bert-fa-sentiment-dksf")
+outputs = model.predict(example)
+print(outputs)
+```
+```
+{'labels': ['positive'], 'probs': [0.812910258769989]}
+```
+- **Sequence labeling (POS, NER, etc.)**
+```python
+from hezar import Model
+
+pos_model = Model.load("hezarai/bert-fa-pos-lscp-500k")  # Part-of-speech
+ner_model = Model.load("hezarai/bert-fa-ner-arman")  # Named entity recognition
+inputs = ["شرکت هوش مصنوعی هزار"]
+pos_outputs = pos_model.predict(inputs)
+ner_outputs = ner_model.predict(inputs)
+print(f"POS: {pos_outputs}")
+print(f"NER: {ner_outputs}")
+```
+```
+POS: [[{'token': 'شرکت', 'tag': 'Ne'}, {'token': 'هوش', 'tag': 'Ne'}, {'token': 'مصنوعی', 'tag': 'AJe'}, {'token': 'هزار', 'tag': 'NUM'}]]
+NER: [[{'token': 'شرکت', 'tag': 'B-org'}, {'token': 'هوش', 'tag': 'I-org'}, {'token': 'مصنوعی', 'tag': 'I-org'}, {'token': 'هزار', 'tag': 'I-org'}]]
+```
+- **Speech recognition**
+```python
+from hezar import Model
+from datasets import load_dataset
+
+ds = load_dataset("mozilla-foundation/common_voice_11_0", "fa", split="test")
+sample = ds[1001]
+whisper = Model.load("hezarai/whisper-small-fa")
+transcript = whisper.predict(sample["path"])  # or pass `sample["audio"]["array"]` (with the right sample rate)
+print(transcript)
+```
+```
+{'transcription': ['و این تنها محدود به محیط کار نیست']}
+```
+
+### Word Embeddings
+- **FastText**
+```python
+from hezar import Embedding
+
+fasttext = Embedding.load("hezarai/fasttext-fa-300")
+most_similar = fasttext.most_similar("هزار")
+print(most_similar)
+```
+```
+[{'score': 0.7579, 'word': 'میلیون'},
+ {'score': 0.6943, 'word': '21هزار'},
+ {'score': 0.6861, 'word': 'میلیارد'},
+ {'score': 0.6825, 'word': '26هزار'},
+ {'score': 0.6803, 'word': '٣هزار'}]
+```
+- **Word2Vec (Skip-gram)**
+```python
+from hezar import Embedding
+
+word2vec = Embedding.load("hezarai/word2vec-skipgram-fa-wikipedia")
+most_similar = word2vec.most_similar("هزار")
+print(most_similar)
+```
+```
+[{'score': 0.7885, 'word': 'چهارهزار'},
+ {'score': 0.7788, 'word': '۱۰هزار'},
+ {'score': 0.7727, 'word': 'دویست'},
+ {'score': 0.7679, 'word': 'میلیون'},
+ {'score': 0.7602, 'word': 'پانصد'}]
+```
+- **Word2Vec (CBOW)**
+```python
+from hezar import Embedding
+
+word2vec = Embedding.load("hezarai/word2vec-cbow-fa-wikipedia")
+most_similar = word2vec.most_similar("هزار")
+print(most_similar)
+```
+```
+[{'score': 0.7407, 'word': 'دویست'},
+ {'score': 0.7400, 'word': 'میلیون'},
+ {'score': 0.7326, 'word': 'صد'},
+ {'score': 0.7276, 'word': 'پانصد'},
+ {'score': 0.7011, 'word': 'سیصد'}]
+```
+
+### Datasets
+You can load any of the datasets on the [Hub](https://huggingface.co/hezarai) like below:
+```python
+from hezar import Dataset 
+
+sentiment_dataset = Dataset.load("hezarai/sentiment-dksf")  # A TextClassificationDataset instance
+lscp_dataset = Dataset.load("hezarai/lscp-pos-500k")  # A SequenceLabelingDataset instance
+xlsum_dataset = Dataset.load("hezarai/xlsum-fa")  # A TextSummarizationDataset instance
+...
+```
+
+### Training
+Hezar makes it super easy to train models using out-of-the-box models and datasets provided in the library.
+```python
+from hezar import (
+    BertSequenceLabeling,
+    BertSequenceLabelingConfig,
+    TrainerConfig,
+    SequenceLabelingTrainer,
+    Dataset,
+    Preprocessor,
+)
+
+base_model_path = "hezarai/bert-base-fa"
+dataset_path = "hezarai/lscp-pos-500k"
+
+train_dataset = Dataset.load(dataset_path, split="train", tokenizer_path=base_model_path)
+eval_dataset = Dataset.load(dataset_path, split="test", tokenizer_path=base_model_path)
+
+model = BertSequenceLabeling(BertSequenceLabelingConfig(id2label=train_dataset.config.id2label))
+preprocessor = Preprocessor.load(base_model_path)
+
+train_config = TrainerConfig(
+    device="cuda",
+    init_weights_from=base_model_path,
+    batch_size=8,
+    num_epochs=5,
+    checkpoints_dir="checkpoints/",
+    metrics=["seqeval"],
+)
+
+trainer = SequenceLabelingTrainer(
+    config=train_config,
+    model=model,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    data_collator=train_dataset.data_collator,
+    preprocessor=preprocessor,
+)
+trainer.train()
+
+trainer.push_to_hub("bert-fa-pos-lscp-500k")  # push model, config, preprocessor, trainer files and configs
+```
+
+Want to go deeper? Check out the [guides](../guide/index.md).