-
Notifications
You must be signed in to change notification settings - Fork 53
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 72146c2
Showing
106 changed files
with
13,343 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Sphinx build info version 1 | ||
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. | ||
config: 7254a81eeaa30845194afa3d93625a5e | ||
tags: 645f666f9bcd5a90fca523b33c5a78b7 |
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
# Add a Dataset |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
# Contribute to Docs |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
# Add a Model |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
# Add Tests |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
# Contribute to Hezar |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
# Contribute | ||
|
||
```{toctree} | ||
contribute_to_hezar.md | ||
add_models.md | ||
add_datasets.md | ||
add_docs.md | ||
add_tests.md | ||
pull_requests.md | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
# Sending a Pull Request |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
# Get Started | ||
```{toctree} | ||
:maxdepth: 1 | ||
|
||
overview.md | ||
installation.md | ||
quick_tour.md | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
# Installation | ||
|
||
#### Install from PyPi | ||
Installing Hezar is as easy as any other Python library! Most of the requirements are cross-platform and installing | ||
them on any machine is a piece of cake! | ||
|
||
``` | ||
pip install hezar | ||
``` | ||
#### Install from source | ||
Also, you can install the dev version of the library using the source: | ||
``` | ||
pip install git+https://github.com/hezarai/hezar.git | ||
``` | ||
|
||
#### Test installation | ||
From a Python console or in CLI just import `hezar` and check the version: | ||
```python | ||
import hezar | ||
|
||
print(hezar.__version__) | ||
``` | ||
``` | ||
0.23.1 | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
# Overview | ||
|
||
Welcome to Hezar! A library that makes state-of-the-art machine learning as easy as possible aimed for the Persian | ||
language, built by the Persian community! | ||
|
||
In Hezar, the primary goal is to provide plug-and-play AI/ML utilities so that you don't need to know much about what's | ||
going on under the hood. Hezar is not just a model library, but instead it's packed with every aspect you need for any | ||
ML pipeline like datasets, trainers, preprocessors, feature extractors, etc. | ||
|
||
Hezar is a library that: | ||
- brings together all the best works in AI for Persian | ||
- makes using AI models as easy as a couple of lines of code | ||
- seamlessly integrates with Hugging Face Hub for all of its models | ||
- has a highly developer-friendly interface | ||
- has a task-based model interface which is more convenient for general users. | ||
- is packed with additional tools like word embeddings, tokenizers, feature extractors, etc. | ||
- comes with a lot of supplementary ML tools for deployment, benchmarking, optimization, etc. | ||
- and more! | ||
|
||
To find out more, just take the [quick tour](quick_tour.md)! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,151 @@ | ||
# Quick Tour | ||
Let's have a quick tour on some of the most important features of Hezar! | ||
|
||
### Models | ||
There's a bunch of ready to use trained models for different tasks on the Hub. To see all the models see [here](https://huggingface.co/hezarai)! | ||
|
||
- **Text classification (sentiment analysis, categorization, etc)** | ||
```python | ||
from hezar import Model | ||
|
||
example = ["هزار، کتابخانهای کامل برای به کارگیری آسان هوش مصنوعی"] | ||
model = Model.load("hezarai/bert-fa-sentiment-dksf") | ||
outputs = model.predict(example) | ||
print(outputs) | ||
``` | ||
``` | ||
{'labels': ['positive'], 'probs': [0.812910258769989]} | ||
``` | ||
- **Sequence labeling (POS, NER, etc.)** | ||
```python | ||
from hezar import Model | ||
|
||
pos_model = Model.load("hezarai/bert-fa-pos-lscp-500k") # Part-of-speech | ||
ner_model = Model.load("hezarai/bert-fa-ner-arman") # Named entity recognition | ||
inputs = ["شرکت هوش مصنوعی هزار"] | ||
pos_outputs = pos_model.predict(inputs) | ||
ner_outputs = ner_model.predict(inputs) | ||
print(f"POS: {pos_outputs}") | ||
print(f"NER: {ner_outputs}") | ||
``` | ||
``` | ||
POS: [[{'token': 'شرکت', 'tag': 'Ne'}, {'token': 'هوش', 'tag': 'Ne'}, {'token': 'مصنوعی', 'tag': 'AJe'}, {'token': 'هزار', 'tag': 'NUM'}]] | ||
NER: [[{'token': 'شرکت', 'tag': 'B-org'}, {'token': 'هوش', 'tag': 'I-org'}, {'token': 'مصنوعی', 'tag': 'I-org'}, {'token': 'هزار', 'tag': 'I-org'}]] | ||
``` | ||
- **Speech recognition** | ||
```python | ||
from hezar import Model | ||
from datasets import load_dataset | ||
|
||
ds = load_dataset("mozilla-foundation/common_voice_11_0", "fa", split="test") | ||
sample = ds[1001] | ||
whisper = Model.load("hezarai/whisper-small-fa") | ||
transcript = whisper.predict(sample["path"]) # or pass `sample["audio"]["array"]` (with the right sample rate) | ||
print(transcript) | ||
``` | ||
``` | ||
{'transcription': ['و این تنها محدود به محیط کار نیست']} | ||
``` | ||
|
||
### Word Embeddings | ||
- **FastText** | ||
```python | ||
from hezar import Embedding | ||
|
||
fasttext = Embedding.load("hezarai/fasttext-fa-300") | ||
most_similar = fasttext.most_similar("هزار") | ||
print(most_similar) | ||
``` | ||
``` | ||
[{'score': 0.7579, 'word': 'میلیون'}, | ||
{'score': 0.6943, 'word': '21هزار'}, | ||
{'score': 0.6861, 'word': 'میلیارد'}, | ||
{'score': 0.6825, 'word': '26هزار'}, | ||
{'score': 0.6803, 'word': '٣هزار'}] | ||
``` | ||
- **Word2Vec (Skip-gram)** | ||
```python | ||
from hezar import Embedding | ||
|
||
word2vec = Embedding.load("hezarai/word2vec-skipgram-fa-wikipedia") | ||
most_similar = word2vec.most_similar("هزار") | ||
print(most_similar) | ||
``` | ||
``` | ||
[{'score': 0.7885, 'word': 'چهارهزار'}, | ||
{'score': 0.7788, 'word': '۱۰هزار'}, | ||
{'score': 0.7727, 'word': 'دویست'}, | ||
{'score': 0.7679, 'word': 'میلیون'}, | ||
{'score': 0.7602, 'word': 'پانصد'}] | ||
``` | ||
- **Word2Vec (CBOW)** | ||
```python | ||
from hezar import Embedding | ||
|
||
word2vec = Embedding.load("hezarai/word2vec-cbow-fa-wikipedia") | ||
most_similar = word2vec.most_similar("هزار") | ||
print(most_similar) | ||
``` | ||
``` | ||
[{'score': 0.7407, 'word': 'دویست'}, | ||
{'score': 0.7400, 'word': 'میلیون'}, | ||
{'score': 0.7326, 'word': 'صد'}, | ||
{'score': 0.7276, 'word': 'پانصد'}, | ||
{'score': 0.7011, 'word': 'سیصد'}] | ||
``` | ||
|
||
### Datasets | ||
You can load any of the datasets on the [Hub](https://huggingface.co/hezarai) like below: | ||
```python | ||
from hezar import Dataset | ||
|
||
sentiment_dataset = Dataset.load("hezarai/sentiment-dksf") # A TextClassificationDataset instance | ||
lscp_dataset = Dataset.load("hezarai/lscp-pos-500k") # A SequenceLabelingDataset instance | ||
xlsum_dataset = Dataset.load("hezarai/xlsum-fa") # A TextSummarizationDataset instance | ||
... | ||
``` | ||
|
||
### Training | ||
Hezar makes it super easy to train models using out-of-the-box models and datasets provided in the library. | ||
```python | ||
from hezar import ( | ||
BertSequenceLabeling, | ||
BertSequenceLabelingConfig, | ||
TrainerConfig, | ||
SequenceLabelingTrainer, | ||
Dataset, | ||
Preprocessor, | ||
) | ||
|
||
base_model_path = "hezarai/bert-base-fa" | ||
dataset_path = "hezarai/lscp-pos-500k" | ||
|
||
train_dataset = Dataset.load(dataset_path, split="train", tokenizer_path=base_model_path) | ||
eval_dataset = Dataset.load(dataset_path, split="test", tokenizer_path=base_model_path) | ||
|
||
model = BertSequenceLabeling(BertSequenceLabelingConfig(id2label=train_dataset.config.id2label)) | ||
preprocessor = Preprocessor.load(base_model_path) | ||
|
||
train_config = TrainerConfig( | ||
device="cuda", | ||
init_weights_from=base_model_path, | ||
batch_size=8, | ||
num_epochs=5, | ||
checkpoints_dir="checkpoints/", | ||
metrics=["seqeval"], | ||
) | ||
|
||
trainer = SequenceLabelingTrainer( | ||
config=train_config, | ||
model=model, | ||
train_dataset=train_dataset, | ||
eval_dataset=eval_dataset, | ||
data_collator=train_dataset.data_collator, | ||
preprocessor=preprocessor, | ||
) | ||
trainer.train() | ||
|
||
trainer.push_to_hub("bert-fa-pos-lscp-500k") # push model, config, preprocessor, trainer files and configs | ||
``` | ||
|
||
Want to go deeper? Check out the [guides](../guide/index.md). |
Oops, something went wrong.