3. Migration Tutorial

This document outlines the process for integrating MLSDK into your PyTorch application and migrating to the MN-Core series.

3.1. Migration Process

When migrating your code, it’s crucial to incrementally expand functionality on MN-Core 2 in a step-by-step manner. For example, if you have a model already running on a GPU or another backend, you should begin by migrating the inference process using your trained model, verifying its operation before proceeding with the training process.

Here, we’ll examine a specific migration workflow using the MNCoreClassifier model, as introduced in Machine Learning Tutorial.

Listing 3.1 /opt/pfn/pfcomp/codegen/MLSDK/examples/mnist_common.py
48class MNCoreClassifier(torch.nn.Module):
49    def __init__(self):
50        super().__init__()
51        self.linear1 = torch.nn.Linear(1024, 256)
52        self.linear2 = torch.nn.Linear(256, 10)
53
54    def forward(self, x, t, **args):
55        x_reshaped = x.reshape(x.size(0), -1)
56        x1 = self.linear1(x_reshaped)
57        x2 = torch.nn.functional.relu(x1)
58        y = self.linear2(x2)
59        loss = torch.nn.functional.cross_entropy(y, t)
60        if self.training:
61            return {"loss": loss}
62        else:
63            return {"y": y, "loss": loss}

The mnist.py script located in the /opt/pfn/pfcomp/codegen/examples/ directory runs both training and inference for the MNCoreClassifier model on MN-Core 2. The corresponding PyTorch-only implementations for these processes are the mnist_train.py and mnist_infer.py scripts.

We’ll proceed with the following sequence:

  1. Verify the operation of the original migration source program

  2. Test execution with pfvm:cpu

  3. Test execution with mncore2:auto

3.1.1. Inference Process

Listing 3.2 /opt/pfn/pfcomp/codegen/MLSDK/examples/mnist_infer.py
16def main(checkpoint_path: str, outdir: str, option_json_path: Optional[Path], device_str: str) -> None:
17    batch_size = 64
18    eval_batch_size = 125
19
20    _, eval_loader = mnist_loaders(batch_size, eval_batch_size)
21
22    checkpoint = torch.load(checkpoint_path)
23
24    model_with_loss_fn = MNCoreClassifier()
25    model_with_loss_fn.load_state_dict(checkpoint["model_state_dict"])
26    model_with_loss_fn.eval()
27
28    def eval_step(inp: Mapping[str, torch.Tensor]) -> Mapping[str, torch.Tensor]:
29        x = inp["x"]
30        t = inp["t"]
31        output = model_with_loss_fn(x, t)
32        y = output["y"]
33        _, predicted = torch.max(y, 1)
34        correct = (predicted == t).sum()
35        return {"correct": correct}
36
37    correct = 0
38    for sample in eval_loader:
39        correct += eval_step(sample)["correct"]
40    print(
41        f"Correct: {correct} / {len(eval_loader.dataset)}. "
42        f"Accuracy: {correct / len(eval_loader.dataset)}"
43    )
44    assert 0.95 < correct / len(eval_loader.dataset)

Verifying the mnist_infer.py Implementation

First, ensure the inference process runs correctly in PyTorch by following the instructions in Example: Inference MNIST. Your trained checkpoint should be saved at /tmp/mlsdk_mnist/checkpoint.pt if you have already executed Example: MNIST on MN-Core 2.

Your verification is complete when the output matches the inference results obtained during training.

Testing with pfvm:cpu

Next, modify the mnist.py script to compile and call the eval_step function, as demonstrated in this example. Upon executing the modified script, specify --device as pfvm:cpu to use the PFVM runtime for processing.

Additionally, you can pass compilation options via --option_json. Below is an example JSON configuration that instructs the compiler to generate Compiled ONNX output. For unmodified mnist.py, the <codegen_dir> should default to /tmp/mlsdk_mnist_infer/eval_step.

Listing 3.3 Example of –out_onnx configuration
{
    "args": [
        "--out_onnx=<codegen_dir>/pfvm.onnx"
    ]
}

If execution completes successfully, you should find two ONNX files in the codegen_dir directory: model.onnx (Exported ONNX) and pfvm.onnx (Compiled ONNX). These ONNX files can be visualized using Netron, which is integrated into the Codegen Dashboard.

If the execution terminates abnormally, please refer to the differences between mnist.py and Common Errors and Solutions for troubleshooting. If the execution completes normally but produces abnormal results, trying these model visualizations can help identify discrepancies in the processing steps.

First, let’s examine Fig. 3.1, which visualizes the model.onnx file.

mnist_infer_model.png

Fig. 3.1 Visualizing model.onnx (Exported ONNX)

This visualization clearly shows that eval_step’s input/output variables (x, t, correct) correspond to the ONNX’s input/output parameters. Additionally, we observe that the output of Transpose serves as the right input for Gemm, while the unused branch of torch.max (ReduceMax) remains intact. As such, at the Exported ONNX stage, the computation graph closely mirrors the original PyTorch operations.

The PFVM-compiled version of this computation graph is pfvm.onnx, which we will now examine in Fig. 3.2.

mnist_infer_pfvm.png

Fig. 3.2 Visualizing pfvm.onnx (Compiled ONNX)

The original computation graph has undergone several optimizations, resulting in a streamlined graph. This optimization enables significant advantages in terms of both memory usage and execution speed when utilizing the PFVM backend with GPUs (pfvm:cuda), rather than CPUs or MN-Core 2 processors, compared to the original PyTorch implementation.

  • Constantization of the shape input parameter in Reshape

  • Operator Fusion

    • Eliminates the Transpose operation on the right input of Gemm by setting transB=1

    • Consolidates consecutive Cast and ReduceSum operations into a single ChainerCastReduceSum operator

  • Eliminates redundant operations surrounding ReduceMax when the result is not used

Note

Many of PFVM’s custom ONNX operators have name prefixes of MNCore or Chainer.

By comparing this graph with the implementation of eval_step and verifying any potential discrepancies in how ONNX handles these operations, we have successfully achieved our visualization objectives.

Verification with mncore2:auto

Finally, let’s verify that eval_step functions correctly by specifying mncore2:auto as the device via the --device option.

If the execution completes successfully, you should find l3ir_stripped.onnx.zst in the codegen_dir directory. Decompress this file (using zstd -d) and visualize it in codegen-dashboard to compare with Fig. 3.3.

mnist_infer_l3ir.png

Fig. 3.3 Visualizing l3ir_stripped.onnx (MNGraph)

Comparing with Fig. 3.2, we can see that l3ir_stripped.onnx essentially represents pfvm.onnx with additional custom operators.

List of custom operators added in this example:

  • MNCoreUpload / MNCoreDownload: Transfers MNValue between LM → DRAM (Upload) or DRAM → LM (Download) directions

  • MNCoreLayoutSwitch: Converts the layout of MNValue

  • Identity: Moves MNValue to the opposite LM (there are two types: LM0 and LM1)

  • MNCoreRefillPadding: Writes values (e.g., kZero, kInf) into padding areas within the layout

Additionally, MNGraph contains information about which operators to execute in what order. This information is consolidated in l3ir.txt within codegen_dir, and for this example it contains the following content:

Constant() -> (val_1_fx2onnx)
  out(0):val_1_fx2onnx onnx_type=Tensor(dtype=INT64 shape=2) num_lw=2 padded_shape=8 layout=PadLayout{(2)/((8_L1B:1); B@[PE,W,MAB,L2B])} layout_kind=MNCore dtype=Int gene=[] loc_kind=IMM loc=IMM)
MNCoreDownload(t) -> (t_Download_1)
   in(0):t onnx_type=Tensor(dtype=INT64 shape=125) num_lw=2 padded_shape=128 layout=PadLayout{(125)/((8_L2B:1, 8_L1B:1, 1:1, 2_W:1); B@[PE,MAB])} layout_kind=MNCore dtype=Int gene=[Nr] loc=DRAM addr=0)
  out(0):t_Download_1 onnx_type=Tensor(dtype=INT64 shape=125) num_lw=2 padded_shape=128 layout=PadLayout{(125)/((8_L2B:1, 8_L1B:1, 1:1, 2_W:1); B@[PE,MAB])} layout_kind=MNCore dtype=Int gene=[Nr] loc=LM0 addr=0)
MNCoreLayoutSwitch(t_Download_1) -> (t_LayoutSwitch_0)
   in(0):t_Download_1 onnx_type=Tensor(dtype=INT64 shape=125) num_lw=2 padded_shape=128 layout=PadLayout{(125)/((8_L2B:1, 8_L1B:1, 1:1, 2_W:1); B@[PE,MAB])} layout_kind=MNCore dtype=Int gene=[Nr] loc_kind=LM loc=LM0 addr=0)
  out(0):t_LayoutSwitch_0 onnx_type=Tensor(dtype=INT64 shape=125) num_lw=2 padded_shape=128 layout=PadLayout{(125)/((8_L2B:1, 8_L1B:1, 2:1); B@[PE,W,MAB])} layout_kind=MNCore dtype=Int gene=[Nr] pad_type=Dirty loc_kind=LM loc=LM0 addr=4)
MNCoreUpload(t_LayoutSwitch_0) -> (t_LayoutSwitch_0_Upload_0)
   in(0):t_LayoutSwitch_0 onnx_type=Tensor(dtype=INT64 shape=125) num_lw=2 padded_shape=128 layout=PadLayout{(125)/((8_L2B:1, 8_L1B:1, 2:1); B@[PE,W,MAB])} layout_kind=MNCore dtype=Int gene=[Nr] pad_type=Dirty loc=LM0 addr=4)
  out(0):t_LayoutSwitch_0_Upload_0 onnx_type=Tensor(dtype=INT64 shape=125) num_lw=2 padded_shape=128 layout=PadLayout{(125)/((8_L2B:1, 8_L1B:1, 2:1); B@[PE,W,MAB])} layout_kind=MNCore dtype=Int gene=[Nr] pad_type=Dirty loc=DRAM addr=526869888)
MNCoreDownload(x) -> (x_Download_0)
   in(0):x onnx_type=Tensor(dtype=FLOAT32 shape=125,1,32,32) num_lw=8 padded_shape=128,1,32,32 layout=PadLayout{(125,1,32,32)/((8_L2B:1, 8_L1B:1, 2:1), (), (16_MAB:1, 2:4), (2:2, 4_W:1, 4_PE:1))} layout_kind=MNCore dtype=Half gene=[N,,,] pad_type=Zero loc=DRAM addr=1024)
  out(0):x_Download_0 onnx_type=Tensor(dtype=FLOAT32 shape=125,1,32,32) num_lw=8 padded_shape=128,1,32,32 layout=PadLayout{(125,1,32,32)/((8_L2B:1, 8_L1B:1, 2:1), (), (16_MAB:1, 2:4), (2:2, 4_W:1, 4_PE:1))} layout_kind=MNCore dtype=Half gene=[N,,,] pad_type=Zero loc=LM0 addr=0)
Reshape(x_Download_0, val_1_fx2onnx) -> (view_fx2onnx)
   in(0):x_Download_0 onnx_type=Tensor(dtype=FLOAT32 shape=125,1,32,32) num_lw=8 padded_shape=128,1,32,32 layout=PadLayout{(125,1,32,32)/((8_L2B:1, 8_L1B:1, 2:1), (), (16_MAB:1, 2:4), (2:2, 4_W:1, 4_PE:1))} layout_kind=MNCore dtype=Half gene=[N,,,] pad_type=Zero loc_kind=LM loc=LM0 addr=0)
   in(1):val_1_fx2onnx onnx_type=Tensor(dtype=INT64 shape=2) num_lw=2 padded_shape=8 layout=PadLayout{(2)/((8_L1B:1); B@[PE,W,MAB,L2B])} layout_kind=MNCore dtype=Int gene=[] loc_kind=IMM loc=IMM)
  out(0):view_fx2onnx onnx_type=Tensor(dtype=FLOAT32 shape=125,1024) num_lw=8 padded_shape=128,1024 layout=PadLayout{(125,1024)/((8_L2B:1, 8_L1B:1, 2:1), (16_MAB:1, 4:2, 4_W:1, 4_PE:1))} layout_kind=MNCore dtype=Half gene=[N,C] pad_type=Zero loc_kind=LM loc=LM0 addr=0 parent=x_Download_0)
Gemm(view_fx2onnx, attr_0, attr_1, transB) -> (addmm_fx2onnx)
   in(0):view_fx2onnx onnx_type=Tensor(dtype=FLOAT32 shape=125,1024) num_lw=8 padded_shape=128,1024 layout=PadLayout{(125,1024)/((8_L2B:1, 8_L1B:1, 2:1), (16_MAB:1, 4:2, 4_W:1, 4_PE:1))} layout_kind=MNCore dtype=Half gene=[N,C] pad_type=Zero loc_kind=LM loc=LM0 addr=0 parent=x_Download_0)
   in(1):attr_0 onnx_type=Tensor(dtype=FLOAT32 shape=256,1024) num_lw=1024 padded_shape=256,1024 layout=PadLayout{(256,1024)/((16:64, 4_W:1, 4_PE:1), (16_MAB:1, 4:16, 4:1, 4:4); B@[L1B,L2B])} layout_kind=MNCore dtype=Half gene=[WC,WC] loc_kind=DRAM loc=DRAM addr=9216)
   in(2):attr_1 onnx_type=Tensor(dtype=FLOAT32 shape=256) num_lw=2 padded_shape=256 layout=PadLayout{(256)/((16_MAB:1, 2:1, 2_W:1, 4_PE:1); B@[L1B,L2B])} layout_kind=MNCore dtype=Float gene=[WC] loc_kind=DRAM loc=DRAM addr=25600)
  out(0):addmm_fx2onnx onnx_type=Tensor(dtype=FLOAT32 shape=125,256) num_lw=2 padded_shape=128,256 layout=PadLayout{(125,256)/((8_L2B:1, 8_L1B:1, 2:1), (16_MAB:1, 4_W:1, 4_PE:1))} layout_kind=MNCore dtype=Half gene=[N,C] pad_type=Dirty loc_kind=LM loc=LM1 addr=0)
...

Due to the limitations of space, we cannot reproduce the entire contents of l3ir.txt. Here we explain each operator based on the provided snippet. By the way, in MNGraph, each operator is referred to as MNNode, and its inputs and outputs are called MNValue. For example, in the notation Constant() -> (val_1_fx2onnx), Constant represents the MNNode and val_1_fx2onnx represents the MNValue. Additionally, in(...): and out(...): provide detailed descriptions of the corresponding MNValue.

  1. Constant() -> (val_1_fx2onnx): Creates a constant tensor to be input to the Reshape operator. Since Constant has no dependencies, it is typically scheduled first.

  2. MNCoreDownload(t) -> (t_Download_1): Moves input tensor t from DRAM to LM

  3. MNCoreLayoutSwitch(t_Download_1) -> (t_LayoutSwitch_0): Changes the layout of tensor t

  4. MNCoreUpload(t_LayoutSwitch_0) -> (t_LayoutSwitch_0_Upload_0): Moves the reformatted tensor t back to DRAM

  5. MNCoreDownload(x) -> (x_Download_0): Moves input tensor x from DRAM to LM

  6. Reshape(x_Download_0, val_1_fx2onnx) -> (view_fx2onnx): Reshapes tensor x

  7. Gemm(view_fx2onnx, attr_0, attr_1, transB) -> (addmm_fx2onnx): Performs matrix multiplication using the newly reshaped tensor x as input

Beyond what can be explained here, the l3ir.txt file is capable of representing most of MNGraph’s information. As the number of graph nodes increases, direct visualization of ONNX becomes increasingly challenging, making the MNGraph log file crucial for verifying its correctness.

Once you’ve confirmed that execution with mncore2:auto works correctly, you can proceed to explore more advanced scheduling options. The following JSON example demonstrates how to specify the --scheduler compilation option:

Listing 3.4 Example of –scheduler configuration
{
    "args": [
        "--scheduler=spill_opt"
    ]
}

After applying this and rerunning the process, you can verify the effects by visualizing l3ir_stripped.onnx again. However, the most direct metric can be found in the report.json file under codegen_dir, specifically the vsm_cycles value. vsm_cycles represents the total number of cycles required to execute the entire VSM. By dividing this by core_freq (expressed in MHz) from the same report.json, you can obtain the actual execution time.

In our testing case, with core_freq set to 750.0 MHz, the default scheduler (reuse_consecutive) resulted in 7500 cycles (0.010 msec, 6.63 TFLOPS = 1.69%), while using the spill_opt scheduler reduced this to 6932 cycles (0.009 msec, 7.17 TFLOPS = 1.82%). Although the MNCoreClassifier inference process itself has a relatively small flops value of 66,273,875 according to report.json, this still doesn’t fully leverage the MN-Core 2’s performance. In practical scenarios, however, we can expect significantly greater improvements.

For details about optimization settings such as schedulers, please refer to Compile Options and Preset Options.

3.1.2. Training Process

Listing 3.5 /opt/pfn/pfcomp/codegen/MLSDK/examples/mnist_train.py
18def main(outdir: str, option_json_path: Optional[Path], device_str: str) -> None:
19    batch_size = 64
20    eval_batch_size = 125
21
22    train_loader, _ = mnist_loaders(batch_size, eval_batch_size)
23
24    model_with_loss_fn = MNCoreClassifier()
25    model_with_loss_fn.train()
26
27    optimizer = torch.optim.SGD(model_with_loss_fn.parameters(), 0.1, 0.9, 0.0)
28
29    def train_step(inp: Mapping[str, torch.Tensor]) -> Mapping[str, torch.Tensor]:
30        x = inp["x"]
31        t = inp["t"]
32        optimizer.zero_grad()
33        output = model_with_loss_fn(x, t)
34        loss = output["loss"]
35        loss.backward()
36        optimizer.step()
37        return {"loss": loss}
38
39    for epoch in range(10):
40        loss = 0.0
41        for i, sample in enumerate(train_loader):
42            curr_loss = train_step(sample)["loss"]
43            loss += (curr_loss - loss) / (i + 1)
44            if i % 100 == 0:
45                print(f"epoch {epoch}, iter {i:4}, loss {loss}")
46        print(f"epoch {epoch}, loss {loss}")
47
48    os.makedirs(outdir, exist_ok=True)
49    torch.save(
50        {
51            "model_state_dict": model_with_loss_fn.state_dict(),
52            "optim_state_dict": optimizer.state_dict(),
53        },
54        storage.path(outdir) / "checkpoint.pt",
55    )

Verifying mnist_train.py Operation

First, verify that the training process runs correctly in PyTorch by referring to Example: Training MNIST. The training results will be saved in <outdir>/checkpoint.pt, allowing you to verify the results using mnist_infer.py.

If the Accuracy value exceeds 0.95, the operation verification is complete.

Testing on pfvm:cpu

Next, modify mnist.py to compile and call the train_step function, similar to Inference Process. Specify --device as pfvm:cpu and also configure the compilation option --out_onnx.

If the execution completes successfully, you should find an ONNX file corresponding to Compiled ONNX in the codegen_dir directory. The visualized version Fig. 3.4 appears as follows (the relationship with Exported ONNX has already been explained above, so we omit further explanation here).

mnist_train_pfvm.png

Fig. 3.4 Visualizing pfvm.onnx (Compiled ONNX)

Compared to standalone inference processing that only performs forward passes, the graph becomes significantly larger when backward propagation and optimizer processing are added. Additionally, backward propagation and optimizer processing often aren’t included in the original program implementation, making it difficult to properly map each node. Therefore, we recommend first verifying that the forward processing works correctly before adding backward and optimizer processing.

Now, even when training appears to complete normally and the loss decreases, upon examining the training checkpoint, you may find that the Accuracy hasn’t improved sufficiently. In such cases, the issue might be that the model or optimizer’s internal torch.Tensor objects aren’t registered in the Context, and changes made on the device aren’t being reflected even after calling Context.synchronize. Details about this mechanism are also explained in Registering Parameters with the Context and Registering Optimizer Buffers with the Context.

Verification with mncore2:auto

Finally, let’s verify whether the train_step function runs on MN-Core 2 by specifying the --device option as mncore2:auto. The final Accuracy may differ from that obtained with pfvm:cpu, but this is because the implementations of individual operators differ—provided the numerical results are above the acceptable threshold, this difference is acceptable.

While the MNCoreClassifier itself isn’t particularly large as a model, the number of operators within the MNGraph easily exceeds 100, making it difficult to visualize the ONNX representation and properly map it to the original implementation. For examining the contents of the MNGraph, we recommend checking the l3ir.txt file instead.

3.2. Advanced Topics

  • Advanced Features describes features not covered in add.py and mnist.py

  • Sample programs utilizing the MLSDK can be found from the Gallery