Skip to content

modify the FormatMetaparameters function in train_lm.py #52

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 73 additions & 37 deletions scripts/train_lm.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,65 +149,101 @@ def FormatMetaparameters(metaparameters, num_train_sets):
out = []
# final output metaparameters in string
out_param = []
# round all float metaparameters to 3rd decimal places

# initial check for repeating values
check_set = ([x for x in metaparameters[num_train_sets:] if metaparameters[num_train_sets:].count(x) > 1])
if check_set != []:
sys.exit("train_lm.py: metaparameters contain duplicated numbers. double check them.")

# round all float metaparameters to the 3rd decimal place
round_index = 3
for param in metaparameters:
assert (param > 0 and param < 1)
x = round(param, round_index)
out.append(x)

size = len(metaparameters)

# check1: whether 0.000 or 1.000 occur because of rounding
# check range: whether 0.000 or 1.000 occur because of rounding
# after this step, no 0.000 or 1.000 occurs unless the original number equal
# to 0 or 1 (if this happen, the parameters are invalid and it exits
# with a warning, while this case can be very rare)
# to 0 or 1 (if this happen, the parameters are invalid and we exit, while
# this case ca be very rare)
for i in range(0, size):
if out[i] in [0.0, 1.0]:
out[i] = RoundToProperDigit(metaparameters[i], round_index)
if out[i] in [0.0, 1.0]:
sys.exit("train_lm.py: there is a {0}. metaparameters should be in range (0, 1).".format(out[i]))

# check2: check repeating values. if happen, then reround them until they are different
# if there exist same values originally, exit with a warning (this can be very rare)
# after this step, there should be no any repeating values in set out
for i in range(0, size):
# Note: the first #num_train_sets weights are not needed to be checked (they can be the same)
if i < num_train_sets:
out_param.append(format(out[i]))
else:
x = out[i]
for j in range(i + 1, size):
y = out[j]
# we will round to the next digit when we detect repeated values
round_digit = round_index + 1
round_max_digit = max(len(str(metaparameters[i])) - 2, len(str(metaparameters[j])) -2)
while x == y and round_digit <= round_max_digit:
x = round(metaparameters[i], round_digit)
y = round(metaparameters[j], round_digit)
round_digit += 1
# actually validate_metaparameters.py will validate those parameters later.
# except for satisfying condition that those parameters should be in
# range (0, 1), parameters D1, D2, D3, D4 for a certain xgram model
# should satisfy D1 > D2 > D3 > D4.
# here we only checked the repeating values but not the order relationship
if x == y:
sys.exit("train_lm.py: {0} and {1} are the same. metaparameters can not be exactly the same.".format(x,y))
out[j] = y

# out[i] now is valided and can be appended to the output set out_param
out[i] = x
out_param.append(format(out[i]))
sys.exit("train_lm.py: {0}th parameter is {1}. metaparameters should be in range (0, 1)".format(i + 1, out[i]))

# append the first #num_train_sets parameters to the output set
# note: the first #num_train_sets parameters (not data-type weights) do not need to be checked
assert num_train_sets >= 1 and num_train_sets < len(metaparameters)
for param in out[:num_train_sets]:
out_param.append(format(param))

# check repeating value in the rest parameters.
# after this while loop, there should be no repeating values in set out
(index1, index2) = FindRepeatingValue(out[num_train_sets:])
while (index1 != -1):
i = index1 + num_train_sets
j = index2 + num_train_sets
(updated_param1, updated_param2) = AdjustValues(out[i], out[j], i, j, metaparameters[i], metaparameters[j])
out[i] = updated_param1
out[j] = updated_param2
(index1, index2) = FindRepeatingValue(out[num_train_sets:])

# now parameters are validated and can be appended to the output set
for param in out[num_train_sets:]:
out_param.append(format(param))

return ','.join(out_param)

# this function checks whether there are repeating values in parameters(excluding
# the first #number numbers, which are not data-type weights) and if found, return
# the index of the first pair repeating values
def FindRepeatingValue(parameters):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the work for excluding the first numbers is not done inside the function. You may need to move that line of comment to the caller.

#assert (number >= 1) and (number < len(parameters))
for i in range(0, len(parameters)):
for j in range(i + 1, len(parameters)):
if parameters[i] == parameters[j]:
return(i, j)
return (-1, -1)

# this function rerounds repeating values until they are different and return the values after rerounding
# if they are indeed the same (collison), exit with a warning (this case can be very rare)
def AdjustValues(number1, number2, index1, index2, ref_number1, ref_number2):
Copy link
Owner

@danpovey danpovey Aug 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if two values are the same (collision), it would be better to just print them with full precision rather than die.
The reason is that collisions are only disallowed in certain circumstances: for the same order of discounting parameter, we cannot have d1 == d2 or d2 == d3. But otherwise, collisions are quite possible. Making this cause the script to die is a bad idea.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So we only let the script die when there are 1) identical
data-directory weights or 2) identical data-type weights for the same order
or 3) exact 0 or 1? I think any of the three cases indicates something must
be wrong and thus we'd better let the script exit with error when any of
them happens. What do you think?
Ke

On Sun, Aug 28, 2016 at 6:54 PM, Daniel Povey [email protected]
wrote:

In scripts/train_lm.py
#52 (comment):

 for i in range(0, len(parameters)):
     for j in range(i + 1, len(parameters)):
         if parameters[i] == parameters[j]:
             return(i, j)
 return (-1, -1)

-# this function rerounds repeating values until they are different and return the values after rerounding
-# if they are indeed the same (collison), exit with a warning (this case can be very rare)
+# this function rerounds repeating values until they are different and returns the values after rerounding.
+# if they are indeed the same (collison), it exits with a warning (this case can be very rare)
def AdjustValues(number1, number2, index1, index2, ref_number1, ref_number2):

I think if two values are the same (collision), it would be better to just
print them with full precision rather than die.
The reason is that collisions are only disallowed in certain
circumstances: for the same order of discounting parameter, we cannot have
d1 == d2 or d2 == d3. But otherwise, collisions are quite possible. Making
this cause the script to die is a bad idea.

And we cannot have two data-directory weights be identical [IIRC]. But
otherwise, collisions


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/danpovey/pocolm/pull/52/files/732e304fd3750c911ce87076cd76aa546090ed2a..e95b774e86bd4ee833b3430ec32a3a052059e26a#r76540845,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ANVxSrRvwF7_vkt7jgPcLer-4Hons9Qyks5qkhGJgaJpZM4Jl6Bt
.

Ke Li
Dept. of Electrical and Computer Engineering
Johns Hopkins University
Email: [email protected]

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do that-- I just thought it might be simpler to ignore it at this
stage. Up to you.
Also, it's only necessary to avoid collisions within the groups of
discounting constants for a particular order. But avoiding other
collisions where possible may be harmless as long as the program doesn't
die.

On Sun, Aug 28, 2016 at 8:15 PM, Ke Li [email protected] wrote:

In scripts/train_lm.py
#52 (comment):

 for i in range(0, len(parameters)):
     for j in range(i + 1, len(parameters)):
         if parameters[i] == parameters[j]:
             return(i, j)
 return (-1, -1)

-# this function rerounds repeating values until they are different and return the values after rerounding
-# if they are indeed the same (collison), exit with a warning (this case can be very rare)
+# this function rerounds repeating values until they are different and returns the values after rerounding.
+# if they are indeed the same (collison), it exits with a warning (this case can be very rare)
def AdjustValues(number1, number2, index1, index2, ref_number1, ref_number2):

I see. So we only let the script die when there are 1) identical
data-directory weights or 2) identical data-type weights for the same order
or 3) exact 0 or 1? I think any of the three cases indicates something must
be wrong and thus we'd better let the script exit with error when any of
them happens. What do you think? Ke
… <#m_-9001043735289973879_>
On Sun, Aug 28, 2016 at 6:54 PM, Daniel Povey _@_.***> wrote: In
scripts/train_lm.py <#52 (comment)
https://github.com/danpovey/pocolm/pull/52#discussion_r76540845>: > for
i in range(0, len(parameters)): > for j in range(i + 1, len(parameters)): >
if parameters[i] == parameters[j]: > return(i, j) > return (-1, -1) > > -#
this function rerounds repeating values until they are different and return
the values after rerounding > -# if they are indeed the same (collison),
exit with a warning (this case can be very rare) > +# this function
rerounds repeating values until they are different and returns the values
after rerounding. > +# if they are indeed the same (collison), it exits
with a warning (this case can be very rare) > def AdjustValues(number1,
number2, index1, index2, ref_number1, ref_number2): I think if two values
are the same (collision), it would be better to just print them with full
precision rather than die. The reason is that collisions are only
disallowed in certain circumstances: for the same order of discounting
parameter, we cannot have d1 == d2 or d2 == d3. But otherwise, collisions
are quite possible. Making this cause the script to die is a bad idea. And
we cannot have two data-directory weights be identical [IIRC]. But
otherwise, collisions — You are receiving this because you authored the
thread. Reply to this email directly, view it on GitHub <
https://github.com/danpovey/pocolm/pull/52/files/
732e304..e95b774
052059e26a#r76540845>, or mute the thread <https://github.com/
notifications/unsubscribe-auth/ANVxSrRvwF7_vkt7jgPcLer-
4Hons9Qyks5qkhGJgaJpZM4Jl6Bt> .
-- Ke Li Dept. of Electrical and Computer Engineering Johns Hopkins
University Email: [email protected]


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/danpovey/pocolm/pull/52/files/732e304fd3750c911ce87076cd76aa546090ed2a..e95b774e86bd4ee833b3430ec32a3a052059e26a#r76542414,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADJVu3MuW1pVXb444aNBWfWKFA-8gfl_ks5qkiSvgaJpZM4Jl6Bt
.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on making this stage simpler. I'll just print out the three kind of
true collisions I mentioned in last email. Even any of the three collisions
indeed happens, it can be found by the validation script later.

On Sun, Aug 28, 2016 at 8:22 PM, Daniel Povey [email protected]
wrote:

In scripts/train_lm.py
#52 (comment):

 for i in range(0, len(parameters)):
     for j in range(i + 1, len(parameters)):
         if parameters[i] == parameters[j]:
             return(i, j)
 return (-1, -1)

-# this function rerounds repeating values until they are different and return the values after rerounding
-# if they are indeed the same (collison), exit with a warning (this case can be very rare)
+# this function rerounds repeating values until they are different and returns the values after rerounding.
+# if they are indeed the same (collison), it exits with a warning (this case can be very rare)
def AdjustValues(number1, number2, index1, index2, ref_number1, ref_number2):

You could do that-- I just thought it might be simpler to ignore it at
this stage. Up to you. Also, it's only necessary to avoid collisions
within the groups of discounting constants for a particular order. But
avoiding other collisions where possible may be harmless as long as the
program doesn't die.
… <#m_4228511531846515297_>
On Sun, Aug 28, 2016 at 8:15 PM, Ke Li _@.> wrote: In
scripts/train_lm.py <#52 (comment)
https://github.com/danpovey/pocolm/pull/52#discussion_r76542414>: > for
i in range(0, len(parameters)): > for j in range(i + 1, len(parameters)): >
if parameters[i] == parameters[j]: > return(i, j) > return (-1, -1) > > -#
this function rerounds repeating values until they are different and return
the values after rerounding > -# if they are indeed the same (collison),
exit with a warning (this case can be very rare) > +# this function
rerounds repeating values until they are different and returns the values
after rerounding. > +# if they are indeed the same (collison), it exits
with a warning (this case can be very rare) > def AdjustValues(number1,
number2, index1, index2, ref_number1, ref_number2): I see. So we only let
the script die when there are 1) identical data-directory weights or 2)
identical data-type weights for the same order or 3) exact 0 or 1? I think
any of the three cases indicates something must be wrong and thus we'd
better let the script exit with error when any of them happens. What do you
think? Ke … <#m_-9001043735289973879_> On Sun, Aug 28, 2016 at 6:54 PM,
Daniel Povey *__@_
.> wrote: In scripts/train_lm.py <#52
#52 (comment) <#52 (comment)
https://github.com/danpovey/pocolm/pull/52#discussion_r76540845>>: >
for i in range(0, len(parameters)): > for j in range(i + 1,
len(parameters)): > if parameters[i] == parameters[j]: > return(i, j) >
return (-1, -1) > > -# this function rerounds repeating values until they
are different and return the values after rerounding > -# if they are
indeed the same (collison), exit with a warning (this case can be very
rare) > +# this function rerounds repeating values until they are different
and returns the values after rerounding. > +# if they are indeed the same
(collison), it exits with a warning (this case can be very rare) > def
AdjustValues(number1, number2, index1, index2, ref_number1, ref_number2): I
think if two values are the same (collision), it would be better to just
print them with full precision rather than die. The reason is that
collisions are only disallowed in certain circumstances: for the same order
of discounting parameter, we cannot have d1 == d2 or d2 == d3. But
otherwise, collisions are quite possible. Making this cause the script to
die is a bad idea. And we cannot have two data-directory weights be
identical [IIRC]. But otherwise, collisions — You are receiving this
because you authored the thread. Reply to this email directly, view it on
GitHub < https://github.com/danpovey/pocolm/pull/52/files/ 732e304
732e304
..e95b774
e95b774
052059e26a#r76540845>, or mute the thread <https://github.com/
notifications/unsubscribe-auth/ANVxSrRvwF7_vkt7jgPcLer-
4Hons9Qyks5qkhGJgaJpZM4Jl6Bt> . -- Ke Li Dept. of Electrical and Computer
Engineering Johns Hopkins University Email: *__@
.*** — You are receiving
this because you commented. Reply to this email directly, view it on GitHub
<https://github.com/danpovey/pocolm/pull/52/files/
732e304..e95b774
052059e26a#r76542414>, or mute the thread <https://github.com/
notifications/unsubscribe-auth/ADJVu3MuW1pVXb444aNBWfWKFA-
8gfl_ks5qkiSvgaJpZM4Jl6Bt> .


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/danpovey/pocolm/pull/52/files/732e304fd3750c911ce87076cd76aa546090ed2a..e95b774e86bd4ee833b3430ec32a3a052059e26a#r76542565,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ANVxSpYFz7GseHRC70OKJAjQkvxbVxn6ks5qkiYwgaJpZM4Jl6Bt
.

Ke Li
Dept. of Electrical and Computer Engineering
Johns Hopkins University
Email: [email protected]

x = number1
y = number2
# set a maximum number for iterations
round_max_digit = 20
# round to the next decimal place if found collison
round_position = max(len(str(x)) - 2, len(str(y)) - 2) + 1
while x == y and round_position <= round_max_digit:
x = round(ref_number1, round_position)
y = round(ref_number2, round_position)
round_position += 1
# actually validate_metaparameters.py will validate those parameters later.
# except for satisfying condition that those parameters should be in
# range (0, 1), parameters D1, D2, D3, D4 for a certain xgram model
# should satisfy D1 > D2 > D3 > D4.
# here we only checked the repeating values but not the order relationship
if x == y:
sys.exit("train_lm.py: the {0}th and {1}th parameters are both {2}. metaparameters can not be exactly the same.".format(index1, index2, y))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could cause a crash for valid metaparameters if they are the same for different orders. The relation D1>D2>D3>D4 is only within a particular order, there can be repeats for different orders. You might need to reorganize the code a bit.


return (x, y)

# this function rerounds a number until it is not zero or one
def RoundToProperDigit(number, round_index):
round_index += 1
x = round(number, round_index)
round_max_digit = len(str(number)) - 2
while x in [0.0, 1.0] and round_index <= round_max_digit:
x = round(number, round_index)
round_index += 1
# since round_max_digit can be smaller than the real decimal places,
# we need double check x. if x still equals to 0 while the original number
# is not zero, we just set the round_max_digit to a safe number, eg 20.
if x == 0 and number > 0:
x = round(number, 20)

return x

def ParseMetaparameters(encoded_str, ngram_order, num_train_sets):
Expand Down
You are viewing a condensed version of this merge commit. You can view the full changes here.