Problem Localization Approach for Maintaining Inference Accuracy During Training

1. System Environment

Hardware Environment (Ascend/GPU/CPU): Ascend

Software Environment:

– MindSpore version: 1.8.0

Execution mode: Static graph (GRAPH) – Python version: 3.7.5

– Operating system platform: linux

2. Problem

The network trains and infers simultaneously, performing inference once every epoch. However, the inference accuracy remains constant;

3. Diagnosis思路 and Process

  1. Checkpoints (ckpt) are saved for each epoch, and loading the ckpt shows that all parameters have changed, indicating that the parameters are being updated normally;

  2. Suspect that the issue may be with eval. Perform inference using a ckpt converted from pth, and the accuracy is similar, suggesting that the forward pass and inference are unlikely to be the problem;

  3. Suspect that the functional writing style may not be fully supported. Change to the model.train format for testing, and find that the loss trend is similar, thus excluding this possibility;

  4. Print the forward results during eval. This is a three-class classification network, and the results are all 2. It is highly likely that the training is not effective;

  5. To reach the torch script, first check the loss function, optimizer, and learning rate, which are consistent. Compare the parameter initialization methods, which have the same logic in the code. Print the parameter names in the initialization method, and find that the original intention was to exclude parameters from the pre-trained bert model from initialization. However, in the actual code, although filtering was done, all parameters were later obtained directly using self.parameter, which should be changed to cell.parameters.

Finally, the problematic code was found to be:

for cell in self.cells():

if type(cell) != BertModel:

for param in self.get_parameter:

xxxxx

self.get_parameter retrieves parameters from the entire network, not just the cell’s parameters