Memory Error: While Preprocessing Tokeinzer for Urdu Language 

Hello , I want to create a tokenizer for urdu language and I have used this 

(tpu_data) D:\>python IndicBERT/tokenization/build_tokenizer.py --input "D:\IndicBERT\ur.txt" --output "D:\IndicBERT\output" --vocab_size 250000

![image](https://github.com/AI4Bharat/IndicBERT/assets/145661315/0dba2bc7-aacd-4cc0-af08-d70d36af4426)


After this: as per instructions:
I used this command:

(tpu_data) D:\>python IndicBERT/process_data/create_mlm_data.py --input_file="D:\IndicBERT\ur.txt" --output_file="D:\IndicBERT\output" --input_file_type=monolingual --tokenizer="D:\IndicBERT\output\config.json"

![image](https://github.com/AI4Bharat/IndicBERT/assets/145661315/f53458bc-2ee9-4c48-92a1-3527b3901db4)


![image](https://github.com/AI4Bharat/IndicBERT/assets/145661315/f4cf98de-27d5-42e4-bf9f-9c76e9f4ea04)

This happened multiple times, 


![image](https://github.com/AI4Bharat/IndicBERT/assets/145661315/5b953898-30bd-46d9-8442-70b3da830b9b)



AS this whole architecture is not using GPU.  
Here are my specs, 

Processor: i7-9700k : 3.6GHz
Ram : 32GB
GPU: Nvidia GTX 1660ti (6gb)


I actually have two questions:

How to resolve this memory error?  Is there a way to use GPU? as this preprocessing is not utilizing the GPU or should I use Google Colab?

Secondly: As I only require a tokenizer for urdu language, After Preprocess Data , Will I have the tokenizer json file?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Memory Error: While Preprocessing Tokeinzer for Urdu Language #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Memory Error: While Preprocessing Tokeinzer for Urdu Language #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions