-
Notifications
You must be signed in to change notification settings - Fork 53
Commit
- Loading branch information
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Sphinx build info version 1 | ||
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. | ||
config: 7254a81eeaa30845194afa3d93625a5e | ||
tags: 645f666f9bcd5a90fca523b33c5a78b7 |
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
# Add a Dataset |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
# Contribute to Docs |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
# Add a Model |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
# Add Tests |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
# Contribute to Hezar |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
# Contribute | ||
|
||
```{toctree} | ||
contribute_to_hezar.md | ||
add_models.md | ||
add_datasets.md | ||
add_docs.md | ||
add_tests.md | ||
pull_requests.md | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
# Sending a Pull Request |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
# Get Started | ||
```{toctree} | ||
:maxdepth: 1 | ||
|
||
overview.md | ||
installation.md | ||
quick_tour.md | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
# Installation | ||
|
||
#### Install from PyPi | ||
Installing Hezar is as easy as any other Python library! Most of the requirements are cross-platform and installing | ||
them on any machine is a piece of cake! | ||
|
||
``` | ||
pip install hezar | ||
``` | ||
#### Install from source | ||
Also, you can install the dev version of the library using the source: | ||
``` | ||
pip install git+https://github.com/hezarai/hezar.git | ||
``` | ||
|
||
#### Test installation | ||
From a Python console or in CLI just import `hezar` and check the version: | ||
```python | ||
import hezar | ||
|
||
print(hezar.__version__) | ||
``` | ||
``` | ||
0.23.1 | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
# Overview | ||
|
||
Welcome to Hezar! A library that makes state-of-the-art machine learning as easy as possible aimed for the Persian | ||
language, built by the Persian community! | ||
|
||
In Hezar, the primary goal is to provide plug-and-play AI/ML utilities so that you don't need to know much about what's | ||
going on under the hood. Hezar is not just a model library, but instead it's packed with every aspect you need for any | ||
ML pipeline like datasets, trainers, preprocessors, feature extractors, etc. | ||
|
||
Hezar is a library that: | ||
- brings together all the best works in AI for Persian | ||
- makes using AI models as easy as a couple of lines of code | ||
- seamlessly integrates with Hugging Face Hub for all of its models | ||
- has a highly developer-friendly interface | ||
- has a task-based model interface which is more convenient for general users. | ||
- is packed with additional tools like word embeddings, tokenizers, feature extractors, etc. | ||
- comes with a lot of supplementary ML tools for deployment, benchmarking, optimization, etc. | ||
- and more! | ||
|
||
To find out more, just take the [quick tour](quick_tour.md)! |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,151 @@ | ||
# Quick Tour | ||
Let's have a quick tour on some of the most important features of Hezar! | ||
|
||
### Models | ||
There's a bunch of ready to use trained models for different tasks on the Hub. To see all the models see [here](https://huggingface.co/hezarai)! | ||
|
||
- **Text classification (sentiment analysis, categorization, etc)** | ||
```python | ||
from hezar import Model | ||
|
||
example = ["هزار، کتابخانهای کامل برای به کارگیری آسان هوش مصنوعی"] | ||
model = Model.load("hezarai/bert-fa-sentiment-dksf") | ||
outputs = model.predict(example) | ||
print(outputs) | ||
``` | ||
``` | ||
{'labels': ['positive'], 'probs': [0.812910258769989]} | ||
``` | ||
- **Sequence labeling (POS, NER, etc.)** | ||
```python | ||
from hezar import Model | ||
|
||
pos_model = Model.load("hezarai/bert-fa-pos-lscp-500k") # Part-of-speech | ||
ner_model = Model.load("hezarai/bert-fa-ner-arman") # Named entity recognition | ||
inputs = ["شرکت هوش مصنوعی هزار"] | ||
pos_outputs = pos_model.predict(inputs) | ||
ner_outputs = ner_model.predict(inputs) | ||
print(f"POS: {pos_outputs}") | ||
print(f"NER: {ner_outputs}") | ||
``` | ||
``` | ||
POS: [[{'token': 'شرکت', 'tag': 'Ne'}, {'token': 'هوش', 'tag': 'Ne'}, {'token': 'مصنوعی', 'tag': 'AJe'}, {'token': 'هزار', 'tag': 'NUM'}]] | ||
NER: [[{'token': 'شرکت', 'tag': 'B-org'}, {'token': 'هوش', 'tag': 'I-org'}, {'token': 'مصنوعی', 'tag': 'I-org'}, {'token': 'هزار', 'tag': 'I-org'}]] | ||
``` | ||
- **Speech recognition** | ||
```python | ||
from hezar import Model | ||
from datasets import load_dataset | ||
|
||
ds = load_dataset("mozilla-foundation/common_voice_11_0", "fa", split="test") | ||
sample = ds[1001] | ||
whisper = Model.load("hezarai/whisper-small-fa") | ||
transcript = whisper.predict(sample["path"]) # or pass `sample["audio"]["array"]` (with the right sample rate) | ||
print(transcript) | ||
``` | ||
``` | ||
{'transcription': ['و این تنها محدود به محیط کار نیست']} | ||
``` | ||
|
||
### Word Embeddings | ||
- **FastText** | ||
```python | ||
from hezar import Embedding | ||
|
||
fasttext = Embedding.load("hezarai/fasttext-fa-300") | ||
most_similar = fasttext.most_similar("هزار") | ||
print(most_similar) | ||
``` | ||
``` | ||
[{'score': 0.7579, 'word': 'میلیون'}, | ||
{'score': 0.6943, 'word': '21هزار'}, | ||
{'score': 0.6861, 'word': 'میلیارد'}, | ||
{'score': 0.6825, 'word': '26هزار'}, | ||
{'score': 0.6803, 'word': '٣هزار'}] | ||
``` | ||
- **Word2Vec (Skip-gram)** | ||
```python | ||
from hezar import Embedding | ||
|
||
word2vec = Embedding.load("hezarai/word2vec-skipgram-fa-wikipedia") | ||
most_similar = word2vec.most_similar("هزار") | ||
print(most_similar) | ||
``` | ||
``` | ||
[{'score': 0.7885, 'word': 'چهارهزار'}, | ||
{'score': 0.7788, 'word': '۱۰هزار'}, | ||
{'score': 0.7727, 'word': 'دویست'}, | ||
{'score': 0.7679, 'word': 'میلیون'}, | ||
{'score': 0.7602, 'word': 'پانصد'}] | ||
``` | ||
- **Word2Vec (CBOW)** | ||
```python | ||
from hezar import Embedding | ||
|
||
word2vec = Embedding.load("hezarai/word2vec-cbow-fa-wikipedia") | ||
most_similar = word2vec.most_similar("هزار") | ||
print(most_similar) | ||
``` | ||
``` | ||
[{'score': 0.7407, 'word': 'دویست'}, | ||
{'score': 0.7400, 'word': 'میلیون'}, | ||
{'score': 0.7326, 'word': 'صد'}, | ||
{'score': 0.7276, 'word': 'پانصد'}, | ||
{'score': 0.7011, 'word': 'سیصد'}] | ||
``` | ||
|
||
### Datasets | ||
You can load any of the datasets on the [Hub](https://huggingface.co/hezarai) like below: | ||
```python | ||
from hezar import Dataset | ||
|
||
sentiment_dataset = Dataset.load("hezarai/sentiment-dksf") # A TextClassificationDataset instance | ||
lscp_dataset = Dataset.load("hezarai/lscp-pos-500k") # A SequenceLabelingDataset instance | ||
xlsum_dataset = Dataset.load("hezarai/xlsum-fa") # A TextSummarizationDataset instance | ||
... | ||
``` | ||
|
||
### Training | ||
Hezar makes it super easy to train models using out-of-the-box models and datasets provided in the library. | ||
```python | ||
from hezar import ( | ||
BertSequenceLabeling, | ||
BertSequenceLabelingConfig, | ||
TrainerConfig, | ||
SequenceLabelingTrainer, | ||
Dataset, | ||
Preprocessor, | ||
) | ||
|
||
base_model_path = "hezarai/bert-base-fa" | ||
dataset_path = "hezarai/lscp-pos-500k" | ||
|
||
train_dataset = Dataset.load(dataset_path, split="train", tokenizer_path=base_model_path) | ||
eval_dataset = Dataset.load(dataset_path, split="test", tokenizer_path=base_model_path) | ||
|
||
model = BertSequenceLabeling(BertSequenceLabelingConfig(id2label=train_dataset.config.id2label)) | ||
preprocessor = Preprocessor.load(base_model_path) | ||
|
||
train_config = TrainerConfig( | ||
device="cuda", | ||
init_weights_from=base_model_path, | ||
batch_size=8, | ||
num_epochs=5, | ||
checkpoints_dir="checkpoints/", | ||
metrics=["seqeval"], | ||
) | ||
|
||
trainer = SequenceLabelingTrainer( | ||
config=train_config, | ||
model=model, | ||
train_dataset=train_dataset, | ||
eval_dataset=eval_dataset, | ||
data_collator=train_dataset.data_collator, | ||
preprocessor=preprocessor, | ||
) | ||
trainer.train() | ||
|
||
trainer.push_to_hub("bert-fa-pos-lscp-500k") # push model, config, preprocessor, trainer files and configs | ||
``` | ||
|
||
Want to go deeper? Check out the [guides](../guide/index.md). |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
# Advanced Training | ||
Docs coming soon, stay tuned! |