Skip to content

modify the FormatMetaparameters function in train_lm.py #52

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 62 additions & 7 deletions scripts/train_lm.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,15 +144,70 @@ def WriteMetaparameters(metaparameters, ngram_order, num_train_sets, out_file):
i += 1
f.close()

def FormatMetaparameters(metaparameters):
def FormatMetaparameters(metaparameters, num_train_sets):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only load the num_train_sets variable when args.bypass_metaparameter_optimization != None. If you need to use this value, please move the loading part of code outside the if condition.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this reminder. I'll modify it.

On Thu, Aug 18, 2016 at 10:27 PM, Wang Jian [email protected]
wrote:

In scripts/train_lm.py
#52 (comment):

@@ -144,15 +144,70 @@ def WriteMetaparameters(metaparameters, ngram_order, num_train_sets, out_file):
i += 1
f.close()

-def FormatMetaparameters(metaparameters):
+def FormatMetaparameters(metaparameters, num_train_sets):

We only load the num_train_sets variable when args.bypass_metaparameter_optimization
!= None. If you need to use this value, please move the loading part of
code outside the if condition.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/danpovey/pocolm/pull/52/files/87ca2779cd82ad774f490c940435f54fe5c991b7#r75419957,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ANVxSkWL3KRq_K2celZT3xl1dbNWv8P3ks5qhRR6gaJpZM4Jl6Bt
.

Ke Li
Dept. of Electrical and Computer Engineering
Johns Hopkins University
Email: [email protected]

# store metaparameters as floats during checking and rerounding
out = []
# final output metaparameters in string
out_param = []
# round all float metaparameters to 3rd decimal places
round_index = 3
for param in metaparameters:
x = '{:.3f}'.format(param)
if x == '0.00':
x = '{:.3g}'.format(param)
x = round(param, round_index)
out.append(x)

return ','.join(out)

size = len(metaparameters)

# check1: whether 0.000 or 1.000 occur because of rounding
# after this step, no 0.000 or 1.000 occurs unless the original number equal
# to 0 or 1 (if this happen, the parameters are invalid and it exits
# with a warning, while this case can be very rare)
for i in range(0, size):
if out[i] in [0.0, 1.0]:
out[i] = RoundToProperDigit(metaparameters[i], round_index)
if out[i] in [0.0, 1.0]:
sys.exit("train_lm.py: there is a {0}. metaparameters should be in range (0, 1).".format(out[i]))

# check2: check repeating values. if happen, then reround them until they are different
# if there exist same values originally, exit with a warning (this can be very rare)
# after this step, there should be no any repeating values in set out
for i in range(0, size):
# Note: the first #num_train_sets weights are not needed to be checked (they can be the same)
if i < num_train_sets:
out_param.append(format(out[i]))
else:
x = out[i]
for j in range(i + 1, size):
y = out[j]
# we will round to the next digit when we detect repeated values
round_digit = round_index + 1
round_max_digit = max(len(str(metaparameters[i])) - 2, len(str(metaparameters[j])) -2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The '... - 2' here, I guess you are assuming the input must start with '0.', but do you check the range of input parameters? Since the input for FormatMetaparameters() is always from the metaparameter file, which is either from optimize_metaparameters.py or from the --bypass-metaparameter-optimization. so I think the check point should be in ParseMetaparameters() function.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I assumed it started '0.' Do you think I should double check the input
parameters' range to exclude wired cases like 1.345 or -0.0123 from the
metaparameter file?
I think as train_lm.py wrote, the FormatMetaparameters() is only called in
the optimization step when we don't provide bypass-metaparameters. And the
output of this function is used as bypass-metaparameters, right? If so, I
think checking those parameters and showing warnings if necessary in
FormatMetaparameters() before they are used as bypass-metaparameters and
then parsed makes sense right?
I think check them in ParseMetaparameters() is ok but can be too late when
the generated bypass-metaparameters are not formatted properly?

On Thu, Aug 18, 2016 at 10:23 PM, Wang Jian [email protected]
wrote:

In scripts/train_lm.py
#52 (comment):

  •            sys.exit("train_lm.py: there is a {0}. metaparameters should be in range (0, 1).".format(out[i]))
    
  • check2: check repeating values. if happen, then reround them until they are different

  • if there exist same values originally, exit with a warning (this can be very rare)

  • after this step, there should be no any repeating values in set out

  • for i in range(0, size):
  •    # Note: the first #num_train_sets weights are not needed to be checked (they can be the same)
    
  •    if i < num_train_sets:
    
  •        out_param.append(format(out[i]))
    
  •    else:
    
  •        x = out[i]
    
  •        for j in range(i + 1, size):
    
  •            y = out[j]
    
  •            # we will round to the next digit when we detect repeated values
    
  •            round_digit = round_index + 1
    
  •            round_max_digit = max(len(str(metaparameters[i])) - 2, len(str(metaparameters[j])) -2)
    

The '... - 2' here, I guess you are assuming the input must start with
'0.', but do you check the range of input parameters? Since the input for
FormatMetaparameters() is always from the metaparameter file, which is
either from optimize_metaparameters.py or from the --bypass-metaparameter-
optimization. so I think the check point should be in
ParseMetaparameters() function.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/danpovey/pocolm/pull/52/files/87ca2779cd82ad774f490c940435f54fe5c991b7#r75419776,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ANVxSiiWytgQn6cuInbLHm-i_sLNVWovks5qhRObgaJpZM4Jl6Bt
.

Ke Li
Dept. of Electrical and Computer Engineering
Johns Hopkins University
Email: [email protected]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, it may be not necessary to check the input for FormatMetaparameters(). But it would be better if we check the args in ParseMetaparameters(), because people can set --bypass-metaparameter-optimization to any possible value, which is not controlled by us.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I should have thought about it. But I found the script
validate_metaparameters.py has done this work (1) range in (0,1); 2)
relationship of data-type weights etc) . If users provide some wired
numbers, they can be detected by this script. So I guess validating them in
ParseMetaparameters() is redundant. How do you think?

On Thu, Aug 18, 2016 at 11:32 PM, Wang Jian [email protected]
wrote:

In scripts/train_lm.py
#52 (comment):

  •            sys.exit("train_lm.py: there is a {0}. metaparameters should be in range (0, 1).".format(out[i]))
    
  • check2: check repeating values. if happen, then reround them until they are different

  • if there exist same values originally, exit with a warning (this can be very rare)

  • after this step, there should be no any repeating values in set out

  • for i in range(0, size):
  •    # Note: the first #num_train_sets weights are not needed to be checked (they can be the same)
    
  •    if i < num_train_sets:
    
  •        out_param.append(format(out[i]))
    
  •    else:
    
  •        x = out[i]
    
  •        for j in range(i + 1, size):
    
  •            y = out[j]
    
  •            # we will round to the next digit when we detect repeated values
    
  •            round_digit = round_index + 1
    
  •            round_max_digit = max(len(str(metaparameters[i])) - 2, len(str(metaparameters[j])) -2)
    

Oh yes, it may be not necessary to check the input for
FormatMetaparameters(). But it would be better if we check the args in
ParseMetaparameters(), because people can set --bypass-metaparameter-
optimization to any possible value, which is not controlled by us.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/danpovey/pocolm/pull/52/files/87ca2779cd82ad774f490c940435f54fe5c991b7#r75423580,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ANVxSoSRq24M1FWdCoOin1YJa5MavXMQks5qhSPRgaJpZM4Jl6Bt
.

Ke Li
Dept. of Electrical and Computer Engineering
Johns Hopkins University
Email: [email protected]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we can leave the work to validate_metaparameters.py.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic, I believe, still not completely right. For example,
For input: metaparameters = [0.1231231,0.1231232,0.1231233,0.12312322]
output: ‘0.1231231,0.1231232,0.1231233,0.1231232’ (the second and forth one find collision)

Second, I think the reason, why input metaparameters = [0.000001,0.000002,0.000003] gives [0,0,0] is because len(str(0.00000001)) = 1e-8 = 5
So, the max precision will be 5 instead of 9. This is a very tricky thing

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Thanks. I'll modify it soon.
Ke

On Fri, Aug 19, 2016 at 1:35 AM, chris920820 [email protected]
wrote:

In scripts/train_lm.py
#52 (comment):

  •            sys.exit("train_lm.py: there is a {0}. metaparameters should be in range (0, 1).".format(out[i]))
    
  • check2: check repeating values. if happen, then reround them until they are different

  • if there exist same values originally, exit with a warning (this can be very rare)

  • after this step, there should be no any repeating values in set out

  • for i in range(0, size):
  •    # Note: the first #num_train_sets weights are not needed to be checked (they can be the same)
    
  •    if i < num_train_sets:
    
  •        out_param.append(format(out[i]))
    
  •    else:
    
  •        x = out[i]
    
  •        for j in range(i + 1, size):
    
  •            y = out[j]
    
  •            # we will round to the next digit when we detect repeated values
    
  •            round_digit = round_index + 1
    
  •            round_max_digit = max(len(str(metaparameters[i])) - 2, len(str(metaparameters[j])) -2)
    

The logic, I believe, still not completely right. For example,
For input: metaparameters = [0.1231231,0.1231232,0.1231233,0.12312322]
output: ‘0.1231231,0.1231232,0.1231233,0.1231232’ (the second and forth
one find collision)

Second, I think the reason, why input metaparameters =
[0.000001,0.000002,0.000003] gives [0,0,0] is because len(str(0.00000001))
= 1e-8 = 5
So, the max precision will be 5 instead of 9. This is a very tricky thing


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/danpovey/pocolm/pull/52/files/87ca2779cd82ad774f490c940435f54fe5c991b7#r75429618,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ANVxStGZoUOaGLH3DzfOWUzL1ZK-wBMSks5qhUCmgaJpZM4Jl6Bt
.

Ke Li
Dept. of Electrical and Computer Engineering
Johns Hopkins University
Email: [email protected]

while x == y and round_digit <= round_max_digit:
x = round(metaparameters[i], round_digit)
y = round(metaparameters[j], round_digit)
round_digit += 1
# actually validate_metaparameters.py will validate those parameters later.
# except for satisfying condition that those parameters should be in
# range (0, 1), parameters D1, D2, D3, D4 for a certain xgram model
# should satisfy D1 > D2 > D3 > D4.
# here we only checked the repeating values but not the order relationship
if x == y:
sys.exit("train_lm.py: {0} and {1} are the same. metaparameters can not be exactly the same.".format(x,y))
out[j] = y

# out[i] now is valided and can be appended to the output set out_param
out[i] = x
out_param.append(format(out[i]))

return ','.join(out_param)

def RoundToProperDigit(number, round_index):
round_index += 1
x = round(number, round_index)
round_max_digit = len(str(number)) - 2
while x in [0.0, 1.0] and round_index <= round_max_digit:
x = round(number, round_index)
round_index += 1
return x

def ParseMetaparameters(encoded_str, ngram_order, num_train_sets):
metaparameters = encoded_str.split(',')
Expand Down Expand Up @@ -342,7 +397,7 @@ def ParseMetaparameters(encoded_str, ngram_order, num_train_sets):
metaparameters = ReadMetaparameters(metaparam_file)
LogMessage("You can set --bypass-metaparameter-optimization='{0}' "
"to get equivalent results".format(
FormatMetaparameters(metaparameters)))
FormatMetaparameters(metaparameters, num_train_sets)))

# make lm dir
lm_dir = os.path.join(args.lm_dir, lm_name + '.pocolm')
Expand Down