-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
some character sets don't work #22
Comments
Hi, it seems that you only process 100 files? Seems to me that there is not enough data for building up the batches in training. |
There are 10 language samples of 100 pairs each. The other ones all work fine; it's just the three above that don't. |
I see. Its quite few to be honest, the model would need more like 10000-100000 pairs to learn from. Anyway, there seems not enough data to produce a validation batch, maybe the batch size is too large? |
Hi |
Doesn't seem so, after preprocessing there should be no filtering anymore. Could you share your config file? To me it seems that there are not enough validation samples to build up a batch (maybe the batch size is 64?) |
As I said above, that can't be the cause. I run this on 10 different languages. Each language has 100 pairs. Only the three languages I cited above fail. I'm appending the master python file and the template yaml file here. import yaml,re
from dp.preprocess import preprocess
from dp.train import train
from dp.phonemizer import Phonemizer
pfx = '/data/2022G2PST-main/'
infix = 'data/target_languages/'
langs = ['per'] #'ben ger ita per swe tgl tha ukr'.split()
#not working: ben,per
#later not working: tha
#get master yaml data
with open('master.yaml') as f:
yamldata = yaml.load(f,Loader=yaml.FullLoader)
#go through the languages one by one
for lang in langs:
print(lang)
#get character sets
alpha = set()
ipa = set()
datasets = []
#go through each data file
for sfx in ['_dev.tsv','_100_train.tsv','_test.tsv']:
f = open(pfx+infix+lang+sfx,'r')
t = f.read()
f.close()
#remove empty line
t = t.split('\n')
t = t[:-1]
#save the data
datasets.append(t)
#test data has no transcription
if sfx == '_test.tsv':
for line in t:
for letter in line:
alpha.add(letter)
#get spelling and transcription for other files
else:
for line in t:
letters,trans = line.split('\t')
for letter in letters:
alpha.add(letter)
for letter in trans:
ipa.add(letter)
#get rid of the space in the transcription set
ipa.remove(' ')
#update yamldata
yamldata['preprocessing']['languages'] = [lang]
yamldata['preprocessing']['text_symbols'] = ''.join(alpha)
yamldata['preprocessing']['phoneme_symbols'] = ''.join(ipa)
#yamldata['preprocessing']['text_symbols'] = list(alpha)
#yamldata['preprocessing']['phoneme_symbols'] = list(ipa)
#make new yaml file
with open(lang+'.yaml','w') as f:
yaml.dump(yamldata,stream=f,allow_unicode=True)
devdata = []
for line in datasets[0]:
word,trans = line.split('\t')
trans = re.sub(' ','',trans)
devdata.append((lang,word,trans))
traindata = []
for line in datasets[1]:
word,trans = line.split('\t')
trans = re.sub(' ','',trans)
traindata.append((lang,word,trans))
testdata = []
for line in datasets[2]:
testdata.append((lang,line))
preprocess(
config_file=lang+'.yaml',
train_data=traindata,
val_data=devdata,
deduplicate_train_data=False
)
train(config_file=lang+'.yaml')
phonemizer = Phonemizer.from_checkpoint(
'/home/mhammond/Desktop/checkpoints/latest_model.pt'
)
errors = 0
#for _,word,trans in devdata:
for _,word in testdata:
#trans = re.sub(' ','',trans)
phonemes = phonemizer(
word,
lang=lang
)
print(word,phonemes)
#if phonemes != trans: errors += 1
#print(f'{lang} errors: {errors}') paths:
checkpoint_dir: /home/mhammond/Desktop/checkpoints
data_dir: /home/mhammond/Desktop/datasets
preprocessing:
languages: ['de', 'en_us']
text_symbols: 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJ'
phoneme_symbols: ['a', 'b', 'd', 'e', 'f', 'g', 'h']
char_repeats: 3
lowercase: true
n_val: 5000
model:
type: 'transformer'
d_model: 512
d_fft: 1024
layers: 6
dropout: 0.1
heads: 4
training:
learning_rate: 0.0001
warmup_steps: 10000
scheduler_plateau_factor: 0.5
scheduler_plateau_patience: 10
batch_size: 32
batch_size_val: 32
epochs: 10
generate_steps: 10000
validate_steps: 10000
checkpoint_steps: 100000
n_generate_samples: 10
store_phoneme_dict_in_model: true |
Yeah that's odd. You could look into the data dir, there is a combined_dataset.txt that stores all the processed tuples as text (after removing out-of-dict phonemes and chars). If that looks good, you could unpickle the val_dataset.pkl to see if that looks good. Might be that too much is being filtered. |
I am facing the same issue with the Bengali dataset. The model is just not training. Initially, the loss was always nan, no matter what number I used for the repeat_char hyper-parameter. After that, I modify the ctc_loss's argument within the library to provide zero when it is getting nan. Still, then, the training loss was not decreasing. |
Hi.
I'm working on this shared task:
https://github.com/sigmorphon/2022G2PST
Some of the character sets work fine, but others do not, specifically: Persian, Bengali, and Thai.
Persian and Bengali fail when training begins. Thai fails at inference.
Any ideas why this might be so?
I'm appending the error below. The problem seems to be in
training/trainer.py
.thank you,
mike h.
The text was updated successfully, but these errors were encountered: