Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update trainer.py #20473

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Update trainer.py #20473

wants to merge 1 commit into from

Conversation

TheMGGdev
Copy link

Cleaning each time you compile the model the previous trainer metrics also erased the compilation of other models with are a cominbation of that model. Given problems in archiquectures shuch as GAN. As explain in issue:

Cleaning each time you compile the model the previous trainer metrics also erased the compilation of other models with are a cominbation of that model. Given problems in archiquectures shuch as GAN. As explain in issue:
Copy link

google-cla bot commented Nov 8, 2024

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@TheMGGdev
Copy link
Author

The issue is: #20474

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 59.94%. Comparing base (8409e18) to head (848fdde).

❗ There is a different number of reports uploaded between BASE (8409e18) and HEAD (848fdde). Click for more details.

HEAD has 6 uploads less than BASE
Flag BASE (8409e18) HEAD (848fdde)
keras 4 1
keras-torch 1 0
keras-tensorflow 1 0
keras-jax 1 0
Additional details and impacted files
@@             Coverage Diff             @@
##           master   #20473       +/-   ##
===========================================
- Coverage   82.07%   59.94%   -22.14%     
===========================================
  Files         515      515               
  Lines       47416    47415        -1     
  Branches     7439     7439               
===========================================
- Hits        38919    28421    -10498     
- Misses       6691    17246    +10555     
+ Partials     1806     1748       -58     
Flag Coverage Δ
keras 59.94% <ø> (-22.00%) ⬇️
keras-jax ?
keras-numpy 59.94% <ø> (-0.03%) ⬇️
keras-tensorflow ?
keras-torch ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@fchollet
Copy link
Collaborator

fchollet commented Nov 9, 2024

Thanks for the PR!

@hertschuh, @jeffcarp, do you remember why this line was inserted? I believe it was intended to fix a bug, however removing it does not seem to break any test.

If we do remove the line, then we need a test for the use case it was breaking.

@hertschuh
Copy link
Collaborator

Thanks for the PR!

@hertschuh, @jeffcarp, do you remember why this line was inserted? I believe it was intended to fix a bug, however removing it does not seem to break any test.

If we do remove the line, then we need a test for the use case it was breaking.

I don't have any context on this.

This is the PR that added this line to fix some bug: #20197

cc @james77777778

@james77777778
Copy link
Contributor

@fchollet @hertschuh
I'm looking into it and will report back soon.

@james77777778
Copy link
Contributor

james77777778 commented Nov 10, 2024

@fchollet
I believe the change of this PR should be fine. The drawback is that there might be some unused metrics in the sub-trainers.

Previously, that line was necessary to prevent an error caused by the nested Trainer arch.
Here’s a reproducible snippet:

import keras

train_images = keras.random.uniform(shape=(32, 784))
train_labels = keras.random.randint(shape=(32, 1), minval=0, maxval=9)

trainer1 = keras.Sequential(
    [keras.layers.Input(shape=(784,)), keras.layers.Dense(1, activation="relu")]
)
trainer1.compile(loss="mse", metrics=["mse"])  # Not calling `build` for `trainer1`'s `CompileLoss`

# Create another model.
inputs = keras.Input(shape=(784,))
x = trainer1(inputs)
outputs = keras.layers.Dense(10)(x)
trainer2 = keras.Model(inputs=inputs, outputs=outputs)
trainer2.compile(loss="binary_crossentropy", metrics=["accuracy"])
trainer2.fit(
    train_images, keras.utils.to_categorical(train_labels, 10), epochs=2
)
# `fit` might fail because `trainer1`'s `CompileLoss` is not built

However, there is logic that skips the metrics for the sub-trainer:

if isinstance(layer, Trainer):
# All Trainer-related metrics in sublayers should be ignored
# because a new Trainer has been instantiated.
continue

This allows us to keep the metrics in the sub-trainer without encountering the unbuilt issue.

We might want to consider adding a similar test as shown above to prevent future breakages.

@fchollet
Copy link
Collaborator

Thank you, @james77777778 and @hertschuh !

@TheMGGdev are you able to add a basic, minimal unit test for your use case, so we avoid breaking it in the future?

We should also include a test based on @james77777778's snippet above.

@TheMGGdev
Copy link
Author

TheMGGdev commented Nov 12, 2024

I don't know if this is exactly what you are asking for. Here is a more simplified example of the issue 20474. In Keras 3.6 it gives error but in Keras 3 it works. In the issue I explain better the error and where I think the error is. This is the code with the prints for debugging:

import numpy as np

from keras import __version__
from keras.models import Sequential, Model
from keras.layers import Input, Dense
from keras.optimizers import Adam

print(__version__)

model_1 = Sequential([
            Input(shape = (100, )),
            Dense(100, activation = "sigmoid"),
        ],)
model_2 = Sequential([
            Input(shape = (100, )),
            Dense(80, activation = "sigmoid"),
        ],)

model_1.compile(loss = 'binary_crossentropy', optimizer = Adam(), metrics = ['accuracy'])

###Print for debugging/show the error 
print('---Debugging/show the error after compiled model_1 and before compiled combined---')
print('model_1.compiled ->', model_1.compiled)
print('model_1.metrics ->', model_1.metrics)
print('model_1._loss_tracker ->', model_1._loss_tracker)
###

combined = Model(Input(shape=(100,)), model_2(model_1(Input(shape=(100,)))))
combined.compile(loss = 'binary_crossentropy', optimizer = Adam())

###Print for debugging/show the error
print('---Debugging/show the error after compiled model_1 and combined---')
print('model_1.compiled ->', model_1.compiled)
print('model_1.metrics ->', model_1.metrics)
print('model_1._loss_tracker ->', model_1._loss_tracker)
###

model_1.train_on_batch(np.random.normal(0, 1, (64, 100)), np.random.normal(0, 1, (64, 100)))
combined.train_on_batch(np.random.normal(0, 1, (64, 100)), np.random.normal(0, 1, (64, 80)))

And this is the code without the prints for the test:

import numpy as np

from keras import __version__
from keras.models import Sequential, Model
from keras.layers import Input, Dense
from keras.optimizers import Adam

print(__version__)

model_1 = Sequential([
            Input(shape = (100, )),
            Dense(100, activation = "sigmoid"),
        ],)
model_2 = Sequential([
            Input(shape = (100, )),
            Dense(80, activation = "sigmoid"),
        ],)

model_1.compile(loss = 'binary_crossentropy', optimizer = Adam(), metrics = ['accuracy'])

combined = Model(Input(shape=(100,)), model_2(model_1(Input(shape=(100,)))))
combined.compile(loss = 'binary_crossentropy', optimizer = Adam())

model_1.train_on_batch(np.random.normal(0, 1, (64, 100)), np.random.normal(0, 1, (64, 100)))
combined.train_on_batch(np.random.normal(0, 1, (64, 100)), np.random.normal(0, 1, (64, 80)))

This is the output in Keras 3.6:
aux

@TheMGGdev
Copy link
Author

I think that the solution for both PR is to clean only the layers of the models that don´t have compiled == True, so the one that have not been compiled already. This way it should work for both cases

@TheMGGdev
Copy link
Author

Any update in this pull request?

@fchollet
Copy link
Collaborator

Are you able to provide a unit test?

@TheMGGdev
Copy link
Author

What exactly are you asking for, I guess what I already sent doesn't work, does it?

@fchollet
Copy link
Collaborator

The code snippet you posted (2nd one) can be turned into an appropriate unit test I think.

@mattdangerw
Copy link
Member

@TheMGGdev Francois is just asking to adapt the code above into a unit tests--an addition to a *_test.py that will run automatically against every future code change. That's a general expectation for us with bug fixes so we don't break ourselves in the future. See this recent PR as a arbitrary example #20550

Are you able to add a unit test here?

@Surya2k1
Copy link
Contributor

Surya2k1 commented Dec 3, 2024

And this is the code without the prints for the test:

It seems the code snippet breaks in 3.7.0 with the proposed change.

3.7.0
2024-12-03 14:35:41.341850: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
  File "D:\OSS\Keras\keras\proposed_fix(test).py", line 89, in <module>
    multi_compile_test()
  File "D:\OSS\Keras\keras\proposed_fix(test).py", line 87, in multi_compile_test
    combined.train_on_batch(np.random.normal(0, 1, (64, 100)), np.random.normal(0, 1, (64, 80)))
  File "D:\OSS\Keras\keras\keras\src\backend\tensorflow\trainer.py", line 598, in train_on_batch
    logs = self.train_function(data())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\OSS\Keras\keras\keras\src\backend\tensorflow\trainer.py", line 224, in function
    outputs = one_step_on_data(data)
              ^^^^^^^^^^^^^^^^^^^^^^
  File "D:\OSS\Keras\.venv\Lib\site-packages\tensorflow\python\util\traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "D:\OSS\Keras\keras\keras\src\backend\tensorflow\trainer.py", line 110, in one_step_on_data
    outputs = self.distribute_strategy.run(step_function, args=(data,))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\OSS\Keras\keras\keras\src\backend\tensorflow\trainer.py", line 56, in train_step
    y_pred = self(x, training=True)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "D:\OSS\Keras\keras\keras\src\utils\traceback_utils.py", line 122, in error_handler
    raise e.with_traceback(filtered_tb) from None
        ^^^^^^^^^^^
  File "D:\OSS\Keras\keras\keras\src\ops\function.py", line 179, in _run_through_graph
    output_tensors.append(tensor_dict[id(x)])
                          ~~~~~~~~~~~^^^^^^^
KeyError: 'Exception encountered when calling Functional.call().\n\n\x1b[1m2425813381776\x1b[0m\n\nArguments received by Functional.call():\n  • inputs=tf.Tensor(shape=(64, 100), dtype=float64)\n  • training=True\n  • mask=None'

@gbaned
Copy link
Collaborator

gbaned commented Jan 8, 2025

Hi @TheMGGdev Any update on this PR? Please.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Assigned Reviewer
Development

Successfully merging this pull request may close these issues.

8 participants