2. Machine Learning Tutorial

This tutorial introduces MLSDK’s machine learning specifications using a sophisticated PyTorch program as an example.

2.1. Prerequisites

You are expected to have completed the contents of Getting Started and have already set up all necessary environments. Additionally, since this tutorial involves downloading the MNIST dataset (”https://docs.pytorch.org/vision/main/generated/torchvision.datasets.MNIST.html”) within the sample program, an internet connection is required. For environments without internet access (such as DevKit), the program can be modified as follows to reference datasets located on any specified path:

Listing 2.1 /opt/pfn/pfcomp/codegen/MLSDK/examples/mnist_common.py

     train_dataset = datasets.MNIST(
-        "/tmp", train=True, transform=transform, download=True
+        "<path-to-MNIST>/", train=True, transform=transform, download=False
     )
...
        eval_dataset = datasets.MNIST(
-        "/tmp",
+        "<path-to-MNIST>/",
         train=False,
         transform=transform,
-        download=True,
+        download=False,
     )

2.2. Running the Sample Program

First, let’s run the mnist.py program similarly to how we handled add.py in Getting Started. For running instructions, please refer to Example: MNIST on MN-Core 2. If you are running mnist.py in a different environment or notebook from add.py, be sure to load both codegen_preload.sh and codegen_pythonpath.sh.

If the output ends with the following results, it indicates successful execution. Note that Accuracy values may vary depending on the MLSDK version, but the value should exceed 0.95 to be considered acceptable.

Correct: 9609 / 10000. Accuracy: 0.9609

Additionally, you should see training logs similar to the following appearing intermittently in the output.

epoch 0, iter    0, loss 2.3125
epoch 0, iter  100, loss 0.6226431969368814
...
epoch 9, iter  900, loss 0.10909322893182918
epoch 9, loss 0.11064393848594248

In this case, the logs are valid as long as the loss values are decreasing gradually.

Since the mnist.py program performs both model training and inference sequentially, the log output follows roughly the following sequence:

Downloading the MNIST dataset (only on the first run)
Compiling processing for 1 training iteration (train_step)
Training for 10 epochs
Compiling processing for 1 inference iteration (eval_step)
Classifying (inference) for 10,000 cases

2.3. Sample Program Explanation

2.3.1. Specifying drop_last

Listing 2.2 /opt/pfn/pfcomp/codegen/MLSDK/examples/mnist_common.py

    train_loader = torch.utils.data.DataLoader(
        train_dataset,
        batch_size=batch_size,
        shuffle=True,
        drop_last=True,
        collate_fn=list_to_dict,
    )

When specifying the drop_last flag, the program skips any remaining iterations when dividing the entire dataset into batch_size-sized chunks. Furthermore, when using MN-Core 2, you must specify the drop_last flag when creating the train_loader.

Context.compile treats the entire processing pipeline as a static computational graph, so it compiles based on the assumed batch size from the sample input. Therefore, if you specify input smaller than the batch_size in the CompiledFunction, it may result in an error due to dimensionality mismatch. To prevent this, drop_last is specified.

Note

Dimensions with variable sizes—such as batch axis sizes—are referred to as Dynamic shape dimensions. Currently, the drop_last specification is required because MLSDK does not support Dynamic shape dimensions, but this will become unnecessary once support is completed in the future.

2.3.2. MNCoreClassifier

Implements a training model using a multilayer perceptron under the name MNCoreClassifier. The parameters described below refer to the weight and bias tensors corresponding to torch.nn.Linear.

2.3.3. Registering Parameters with the Context

Listing 2.3 /opt/pfn/pfcomp/codegen/MLSDK/examples/mnist.py

    set_tensor_name_in_module(model_with_loss_fn, "model_with_loss_fn")
    for p in model_with_loss_fn.parameters():
        context.register_param(p)

mlsdk.set_tensor_name_in_module() assigns names to each tensor in the model that the Context uses for identification. These names are referenced internally by mlsdk.Context.register_param() or mlsdk.Context.register_buffer(), so set_tensor_name_in_module must be called before invoking these APIs.

In this example, the names assigned to the parameters will be as follows: These names can be retrieved using mlsdk.get_tensor_name().

model_with_loss_fn@linear1/weight
model_with_loss_fn@linear1/bias
model_with_loss_fn@linear2/weight
model_with_loss_fn@linear2/bias

Note

The names are assigned by setting the FX2ONNX_EXPORTER_TENSOR_NAME_ATTR attribute on each parameter tensor using setattr.

Listing 2.4 /opt/pfn/pfcomp/codegen/MLSDK/examples/mnist.py

    set_tensor_name_in_module(model_with_loss_fn, "model_with_loss_fn")
    for p in model_with_loss_fn.parameters():
        context.register_param(p)

register_param should be applied to parameters that will be updated during training. If register_param is not performed, training will appear to proceed normally, but this is because only the parameters on the device are being updated. The actual parameters on the host will not be updated until synchronization occurs via Context.synchronize. Therefore, registering parameters with the Context is essential for any model training program. Similar to parameters, buffers also require register_buffer.

When working with multiple models within the same Context, each model requires both set_tensor_name_in_module and register_param as well as register_buffer. An example demonstrating the handling of multiple models can be found in Example: Inference With Multiple Models.

2.3.4. Registering Optimizer Buffers with the Context

Listing 2.5 /opt/pfn/pfcomp/codegen/MLSDK/examples/mnist.py

    optimizer = MNCoreSGD(model_with_loss_fn.parameters(), 0.1, 0.9, 0.0)
    set_buffer_name_in_optimizer(optimizer, "optimizer")
    context.register_optimizer_buffers(optimizer)

mlsdk.MNCoreSGD is a reimplementation of torch.optim.SGD adapted for MLSDK. While it shares basic options like learning rate (lr), full compatibility cannot be guaranteed. Although this example uses SGD, other available optimizers include mlsdk.MNCoreAdam and mlsdk.MNCoreAdamW.

Listing 2.6 /opt/pfn/pfcomp/codegen/MLSDK/examples/mnist.py

    optimizer = MNCoreSGD(model_with_loss_fn.parameters(), 0.1, 0.9, 0.0)
    set_buffer_name_in_optimizer(optimizer, "optimizer")
    context.register_optimizer_buffers(optimizer)

mlsdk.set_buffer_name_in_optimizer() assigns names to each tensor representing the optimizer’s internal state, enabling the Context to identify them. These names are referenced internally by mlsdk.Context.register_optimizer_buffers(). By registering these buffers, the optimizer’s internal state_dict gets updated during Context.synchronize, allowing you to include it in training checkpoints.

2.3.5. Functionizing Training and Inference Operations

Listing 2.7 /opt/pfn/pfcomp/codegen/MLSDK/examples/mnist.py

    def train_step(inp: Mapping[str, torch.Tensor]) -> Mapping[str, torch.Tensor]:
        x = inp["x"]
        t = inp["t"]
        optimizer.zero_grad()
        output = model_with_loss_fn(x, t)
        loss = output["loss"]
        loss.backward()
        optimizer.step()
        return {"loss": loss}

The core training loop - Forward → Backward → Optimize operations (lines 4-8) - is encapsulated as a function that can be passed to Context.compile(). While it requires converting arguments and return values to Mapping[str, torch.Tensor], modifying the actual processing logic is generally unnecessary when organizing operations into functions.

Listing 2.8 /opt/pfn/pfcomp/codegen/MLSDK/examples/mnist.py

    def eval_step(inp: Mapping[str, torch.Tensor]) -> Mapping[str, torch.Tensor]:
        x = inp["x"]
        t = inp["t"]
        output = model_with_loss_fn(x, t)
        y = output["y"]
        _, predicted = torch.max(y, 1)
        correct = (predicted == t).sum()
        return {"correct": correct}

For inference as well, the Forward → Max+Sum operations are similarly encapsulated as a function like train_step. Not only does this approach handle model and optimizer operations, but it also consolidates all post-processing results to be handled by MN-Core 2.

2.3.6. Specifying Compilation Options

Listing 2.9 /opt/pfn/pfcomp/codegen/MLSDK/examples/mnist.py

    compile_options = {}
    if option_json_path is not None:
        compile_options["option_json"] = str(option_json_path)

The codegen backend of MLSDK supports numerous environment variables and command-line options that can be specified during compilation. Among these, Preset Options provides optimized combinations of these settings. Each Preset Option is stored in JSON format under /opt/pfn/pfcomp/codegen/preset_options/, ranging from debugging configurations in debug.json to advanced optimization settings in O4.json. Important note: Some optimization options may alter the operation order, which could potentially change the final computed results.

mnist.py allows specifying Preset Options as follows:

$ cd /opt/pfn/pfcomp/codegen/examples/
$ ./exec_with_env.sh python3 mnist.py --option_json /opt/pfn/pfcomp/codegen/preset_options/O1.json

To examine the actual performance improvements, refer to Codegen Dashboard. For a complete list of compilation options, see Compile Options.

2.3.7. Synchronizing Training Results

Listing 2.10 /opt/pfn/pfcomp/codegen/MLSDK/examples/mnist.py

    for epoch in range(10):
        loss = 0.0
        for i, sample in enumerate(train_loader):
            curr_loss = compiled_train_step(sample)["loss"].item()
            loss += (curr_loss - loss) / (i + 1)
            if i % 100 == 0:
                print(f"epoch {epoch}, iter {i:4}, loss {loss}")
        print(f"epoch {epoch}, loss {loss}")

    context.synchronize()

In this example, the Context.synchronize method is called after completing 10 epochs of training. All parameters registered in the Context are synchronized at this point, so saving the state_dict should occur after synchronization.

2.3.8. Saving Training Results

Listing 2.11 /opt/pfn/pfcomp/codegen/MLSDK/examples/mnist.py

    torch.save(
        {
            "model_state_dict": model_with_loss_fn.state_dict(),
            "optim_state_dict": optimizer.state_dict(),
        },
        storage.path(outdir) / "checkpoint.pt",
    )

The trained model and optimizer’s state_dict are saved using torch.save here. The default save location is /tmp/mlsdk_mnist/checkpoint.pt unless an explicit --outdir is specified.

The saved content can be restored using torch.load, enabling the utilization of training results across different devices.

2.4. Advanced Topics

Migration Tutorial provides guidance on adapting your program to run on MN-Core series devices.
Advanced Features covers features not covered in add.py and mnist.py.
Examples demonstrating MLSDK usage are available from the Gallery.