Description
Hi there,
I have been trying to run the MaxCNN model and was wondering about this section of the code:
def forward(self, x):
if x.get_device() == 0:
tmp = torch.zeros(x.shape[0],x.shape[1],128,4,4).cuda()
else:
tmp = torch.zeros(x.shape[0],x.shape[1],128,4,4).cpu()
for i in range(7):
tmp[:,i] = self.pool1( ##F.relu(self.conv7(self.pool1(F.relu(self.conv6(F.relu(self.conv5(self.pool1( F.relu(self.conv4(F.relu(self.conv3( F.relu(self.conv2(F.relu(self.conv1(x[:,i])))))))))))))))))
x = tmp.reshape(x.shape[0], x.shape[1],4*128*4,1)
x = self.pool(x)
x = x.view(x.shape[0],-1)
x = self.fc2(self.fc(x))
x = self.max(x)
return x
In particular, the self.pool layer appears to be applying a maxpool kernel of size (n_windows, 1) (i.e. 7, 1) over a reshaped tensor of shape (x.shape[0], x.shape[1], 2048, 1). The result is that the kernel of height 7 works down the 2048 rows and finds the maximum of each 7-row receptive field. I was expecting the maxpooling to occur across the frames (i.e. the temporal dimension), but this isn't happening. I wonder if this is an error in translating from the original code, because the paper defines this model as "performs max-pooling over ConvNet outputs across time frames", and it does seem like the maxpool kernel size is intended to take number of frames into account (n_windows), but then it isn't applied across frames.
I note that the 1st FC layer is defined as follows:
self.fc = nn.Linear(n_window*int(4*4*128/n_window),512)
which takes into account that the 2048 rows aren't divisible by the maxpool kernel height of 7, but it seems a un-neat which makes me wonder if this is what was originally intended.
Many thanks!