Tree: dd41e2e47a

9.6 kB

Raw Blame History

Regression test

Regression test

How to run
Correctness
Performance
Debug tools
To do list

How to run

Run correctness regression test by

rlaunch --cpu=4 --memory=15000 --gpu=1 -- python3 verify_correctness.py

Run performance regression test by

rlaunch --cpu=4 --memory=15000 --gpu=1 -- python3 run_resnet50_perf.py

Compare with the reference result to verify the performance change.

[Temporary]: Run dynamic graph test

cd python_module/megengine/examples/cifar10/resnet_example
rlaunch --cpu=4 --memory=15000 --gpu=1 -- MGE_DISABLE_TRACE=1 python3 main.py --mode train --backend megengine-dynamic

Be sure to run a few epochs to verify the CPU/GPU memory usage and the result tends to converge. The complete run takes around 2 hours.

Correctness

Pre-trained Resnet18 model on cifar10 dataset is used.

The test set contains

forward run with static graph
forward run with dynamic graph
forward + backward + parameter update with static graph
forward + backward + parameter update with dynamic graph

Sample output:

Running fwd static ...
Success
Running fwd dynamic ...
Success
Running train static ...
Success
Running train dynamic ...
Failed!!!
import megengine operator
[INFO] load /home/zhangfan/.local/lib/python3.6/site-packages/megengine/examples/cifar10/resnet_example/checkpoint/pytorch_init.pth done
calculated loss: [2.3731833, 34.4626]
expect: [ 2.3731833 34.460594 ]

Performance

Test cases run Resnet 50 training with batch size = 64.

Run python3 resnet50_perf.py --help for valid options.

Example script:

Run python3 run_resnet50_perf.py
You may want to submit the job to a remote server by rlaunch --cpu=16 --memory=100384 --gpu=8 -- python3 run_resnet50_perf.py
Sample output

**************************************
Run ResNet 50 performance test with batch size = 64
**************************************
Run static graph with default opt level
Finish with GPU Usage 6710MiB
Wall time per iter 283 ms
Run status: finished
**************************************
Run static graph with conv fastrun
Finish with GPU Usage 6540MiB
Wall time per iter 265 ms
Run status: finished
**************************************
Run static graph with conv fastrun and JIT
Finish with GPU Usage 6540MiB
Wall time per iter 267 ms
Run status: finished
**************************************
Run static graph with JIT, conv fastrun and without running step
Finish with GPU Usage 6540MiB
Wall time per iter 223 ms
Run status: finished
**************************************

Debug tools

You can pass --run-debug-tool to script run_resnet50_perf.py. Opr-level profiling result and valgrind will be invoked.

How much overhead time will it take due to usage of the profiler

Please compare the same job with/without profiler. The timing statistic reported by profiler does not include the overhead time from itself.

How can I get more information from profiler?

Refer to the main function in megengine.utils.profile_analyze.

How can I profile main memory usage?

Valgrind massif tool can be used. The script also prints memory usage summary on screen as:


    GB
1.836^                                                             #          
     |                                                           @@#::::::@:::
     |                                                         @@@ #::::::@:::
     |                                 ::::::::::::@:::::::::@:@@@ #::::::@:::
     |                                ::::: :::::: @ ::: ::: @:@@@ #::::::@:::
     |                              @@::::: :::::: @ ::: ::: @:@@@ #::::::@:::
     |                            ::@@::::: :::::: @ ::: ::: @:@@@ #::::::@:::
     |                          @:: @@::::: :::::: @ ::: ::: @:@@@ #::::::@:::
     |                        @@@:: @@::::: :::::: @ ::: ::: @:@@@ #::::::@:::
     |                       :@@@:: @@::::: :::::: @ ::: ::: @:@@@ #::::::@:::
     |                     @::@@@:: @@::::: :::::: @ ::: ::: @:@@@ #::::::@:::
     |                    @@::@@@:: @@::::: :::::: @ ::: ::: @:@@@ #::::::@:::
     |                  @:@@::@@@:: @@::::: :::::: @ ::: ::: @:@@@ #::::::@:::
     |                 :@ @@::@@@:: @@::::: :::::: @ ::: ::: @:@@@ #::::::@:::
     |              ::::@ @@::@@@:: @@::::: :::::: @ ::: ::: @:@@@ #::::::@:::
     |            :::: :@ @@::@@@:: @@::::: :::::: @ ::: ::: @:@@@ #::::::@:::
     |          :@: :: :@ @@::@@@:: @@::::: :::::: @ ::: ::: @:@@@ #::::::@:::
     |         :@@: :: :@ @@::@@@:: @@::::: :::::: @ ::: ::: @:@@@ #::::::@:::
     |       @@:@@: :: :@ @@::@@@:: @@::::: :::::: @ ::: ::: @:@@@ #::::::@:::
     |      @@ :@@: :: :@ @@::@@@:: @@::::: :::::: @ ::: ::: @:@@@ #::::::@:::
   0 +----------------------------------------------------------------------->Gi
     0                                                                   19.39

You can change "--run-iter" value to adjust iters to profile.
The detailed profiling is printed to massif.out.ms_print.

How can I understand the profiler result?

The dumped profiling file prof.json can be interpolated by megengine/utils/profile_analyze.py.
The following information is printed from the profiler:

-----------------  --------
total device time  0.318062
total host time    0.275643
-----------------  --------

╒════════════════════╤══════════════╤═══════════════════════════╤═══════════════╤═════════╤══════════╤═════════════╤═════════════╤══════════════╕
│ device self time   │ cumulative   │ operator info             │ computation   │ FLOPS   │ memory   │ bandwidth   │ in_shapes   │ out_shapes   │
╞════════════════════╪══════════════╪═══════════════════════════╪═══════════════╪═════════╪══════════╪═════════════╪═════════════╪══════════════╡
│ #0                 │ 0.114        │ Elemwise                  │ 6.53          │ 57.40   │ 51.63    │ 454.02      │ None        │ None         │
│ 0.114              │ 35.8%        │ 1481                      │ GFLO          │ GFLOPS  │ GiB      │ GiB/s       │             │              │
│ 35.8%              │              │ N/A                       │               │         │          │             │             │              │
├────────────────────┼──────────────┼───────────────────────────┼───────────────┼─────────┼──────────┼─────────────┼─────────────┼──────────────┤
│ #1                 │ 0.176        │ ConvolutionBackwardFilter │ 523.15        │ 8.35    │ 5.28     │ 84.24       │ None        │ None         │
│ 0.0627             │ 55.5%        │ 53                        │ GFLO          │ TFLOPS  │ GiB      │ GiB/s       │             │              │
│ 19.7%              │              │ N/A                       │               │         │          │             │             │              │
├────────────────────┼──────────────┼───────────────────────────┼───────────────┼─────────┼──────────┼─────────────┼─────────────┼──────────────┤
│ #2                 │ 0.221        │ ConvolutionBackwardData   │ 508.05        │ 11.31   │ 5.05     │ 112.42      │ None        │ None         │
│ 0.0449             │ 69.6%        │ 52                        │ GFLO          │ TFLOPS  │ GiB      │ GiB/s       │             │              │
│ 14.1%              │              │ N/A                       │               │         │          │             │             │              │
├────────────────────┼──────────────┼───────────────────────────┼───────────────┼─────────┼──────────┼─────────────┼─────────────┼──────────────┤

Please read megengine/utils/profile_analyze.py for more usages.

To do list

Change numerical tolerance after XPU-280 is done
Add scripts to facilitate log analysis
Profile GPU memory
Incorporate with QA system
Add more regression tests

9.6 kB Raw Blame History