- Installation
- Quick Start
- Features
- Split Textfile
- Build Parallel Corpus
- Separate Parallel Corpus
- Decontruct Words of Sentence
- Remove Punctuation
- Space Punctuation
- Text File to List
- Text File to Dataframe
- List to Text File
- Remove File
- Count Characters of a Sentence
- Count Words of Sentence
- Count No of Lines in a Text File
- Convert Excel to Multiple Text Files
- Merge Multiple Text Files
- Apply Any Function in a Full Text File
Install the latest stable release
For windows
pip install -U data-preprocessors
For Linux/WSL2
pip3 install -U data-preprocessors
from data_preprocessors import text_preprocessor as tp
sentence = "bla! bla- ?bla ?bla."
sentence = tp.remove_punc(sentence)
print(sentence)
>> bla bla bla bla
This function will split your textfile into train, test and validate. Three separate text files. By changing shuffle
and seed
value, you can randomly shuffle the lines of your text files.
from data_preprocessors import text_preprocessor as tp
tp.split_textfile(
main_file_path="example.txt",
train_file_path="splitted/train.txt",
val_file_path="splitted/val.txt",
test_file_path="splitted/test.txt",
train_size=0.6,
val_size=0.2,
test_size=0.2,
shuffle=True,
seed=42
)
# Total lines: 500
# Train set size: 300
# Validation set size: 100
# Test set size: 100
By using this function, you will be able to easily separate src_tgt_file
into separated src_file
and tgt_file
.
from data_preprocessors import text_preprocessor as tp
tp.separate_parallel_corpus(src_tgt_file="", separator="|||", src_file="", tgt_file="")
tp.decontracting_words(sentence)
By using this function, you will be able to remove the punction of a single line of a text file.
from data_preprocessors import text_preprocessor as tp
sentence = "bla! bla- ?bla ?bla."
sentence = tp.remove_punc(sentence)
print(sentence)
# bla bla bla bla
By using this function, you will be able to add one space to the both side of the punction so that it will easier to tokenize the sentence. This will apply on a single line of a text file. But if we want, we can use it in a full twxt file.
from data_preprocessors import text_preprocessor as tp
sentence = "bla! bla- ?bla ?bla."
sentence = tp.space_punc(sentence)
print(sentence)
# bla bla bla bla
Convert any text file into list.
mylist= tp.text2list(myfile_path="myfile.txt")
Convert any list into a text file (filename.txt)
tp.list2text(mylist=mylist, myfile_path="myfile.txt")
This function will help to count the total characters of a sentence.
tp.count_chars(myfile="file.txt")
This function will help to Convert an Excel file's columns into multiple text files.
tp.excel2multitext(excel_file_path="",
column_names=None,
src_file="",
tgt_file="",
aligns_file="",
separator="|||",
src_tgt_file="",
)
In the place of function_name
you can use any function and that function will be applied in the full/whole text file.
from data_preprocessors import text_preprocessor as tp
tp.apply_whole(
function_name,
myfile_path="myfile.txt",
modified_file_path="modified_file.txt"
)