rlaunch --cpu=4 --memory=15000 --gpu=1 -- python3 verify_correctness.py
rlaunch --cpu=4 --memory=15000 --gpu=1 -- python3 run_resnet50_perf.py
Compare with the reference result to verify the performance change.
cd python_module/megengine/examples/cifar10/resnet_example
rlaunch --cpu=4 --memory=15000 --gpu=1 -- MGE_DISABLE_TRACE=1 python3 main.py --mode train --backend megengine-dynamic
Be sure to run a few epochs to verify the CPU/GPU memory usage and the result tends to converge. The complete run takes around 2 hours.
Pre-trained Resnet18 model on cifar10 dataset is used.
The test set contains
Sample output:
Running fwd static ...
Success
Running fwd dynamic ...
Success
Running train static ...
Success
Running train dynamic ...
Failed!!!
import megengine operator
[INFO] load /home/zhangfan/.local/lib/python3.6/site-packages/megengine/examples/cifar10/resnet_example/checkpoint/pytorch_init.pth done
calculated loss: [2.3731833, 34.4626]
expect: [ 2.3731833 34.460594 ]
Test cases run Resnet 50 training with batch size = 64.
Run python3 resnet50_perf.py --help
for valid options.
Example script:
python3 run_resnet50_perf.py
rlaunch --cpu=16 --memory=100384 --gpu=8 -- python3 run_resnet50_perf.py
**************************************
Run ResNet 50 performance test with batch size = 64
**************************************
Run static graph with default opt level
Finish with GPU Usage 6710MiB
Wall time per iter 283 ms
Run status: finished
**************************************
Run static graph with conv fastrun
Finish with GPU Usage 6540MiB
Wall time per iter 265 ms
Run status: finished
**************************************
Run static graph with conv fastrun and JIT
Finish with GPU Usage 6540MiB
Wall time per iter 267 ms
Run status: finished
**************************************
Run static graph with JIT, conv fastrun and without running step
Finish with GPU Usage 6540MiB
Wall time per iter 223 ms
Run status: finished
**************************************
You can pass --run-debug-tool
to script run_resnet50_perf.py
. Opr-level profiling result and valgrind will be invoked.
Please compare the same job with/without profiler. The timing statistic reported by profiler does not include the overhead time from itself.
Refer to the main function in megengine.utils.profile_analyze
.
Valgrind massif tool can be used. The script also prints memory usage summary on screen as:
GB
1.836^ #
| @@#::::::@:::
| @@@ #::::::@:::
| ::::::::::::@:::::::::@:@@@ #::::::@:::
| ::::: :::::: @ ::: ::: @:@@@ #::::::@:::
| @@::::: :::::: @ ::: ::: @:@@@ #::::::@:::
| ::@@::::: :::::: @ ::: ::: @:@@@ #::::::@:::
| @:: @@::::: :::::: @ ::: ::: @:@@@ #::::::@:::
| @@@:: @@::::: :::::: @ ::: ::: @:@@@ #::::::@:::
| :@@@:: @@::::: :::::: @ ::: ::: @:@@@ #::::::@:::
| @::@@@:: @@::::: :::::: @ ::: ::: @:@@@ #::::::@:::
| @@::@@@:: @@::::: :::::: @ ::: ::: @:@@@ #::::::@:::
| @:@@::@@@:: @@::::: :::::: @ ::: ::: @:@@@ #::::::@:::
| :@ @@::@@@:: @@::::: :::::: @ ::: ::: @:@@@ #::::::@:::
| ::::@ @@::@@@:: @@::::: :::::: @ ::: ::: @:@@@ #::::::@:::
| :::: :@ @@::@@@:: @@::::: :::::: @ ::: ::: @:@@@ #::::::@:::
| :@: :: :@ @@::@@@:: @@::::: :::::: @ ::: ::: @:@@@ #::::::@:::
| :@@: :: :@ @@::@@@:: @@::::: :::::: @ ::: ::: @:@@@ #::::::@:::
| @@:@@: :: :@ @@::@@@:: @@::::: :::::: @ ::: ::: @:@@@ #::::::@:::
| @@ :@@: :: :@ @@::@@@:: @@::::: :::::: @ ::: ::: @:@@@ #::::::@:::
0 +----------------------------------------------------------------------->Gi
0 19.39
You can change "--run-iter" value to adjust iters to profile.
The detailed profiling is printed to massif.out.ms_print
.
The dumped profiling file prof.json
can be interpolated by megengine/utils/profile_analyze.py.
The following information is printed from the profiler:
----------------- --------
total device time 0.318062
total host time 0.275643
----------------- --------
╒════════════════════╤══════════════╤═══════════════════════════╤═══════════════╤═════════╤══════════╤═════════════╤═════════════╤══════════════╕
│ device self time │ cumulative │ operator info │ computation │ FLOPS │ memory │ bandwidth │ in_shapes │ out_shapes │
╞════════════════════╪══════════════╪═══════════════════════════╪═══════════════╪═════════╪══════════╪═════════════╪═════════════╪══════════════╡
│ #0 │ 0.114 │ Elemwise │ 6.53 │ 57.40 │ 51.63 │ 454.02 │ None │ None │
│ 0.114 │ 35.8% │ 1481 │ GFLO │ GFLOPS │ GiB │ GiB/s │ │ │
│ 35.8% │ │ N/A │ │ │ │ │ │ │
├────────────────────┼──────────────┼───────────────────────────┼───────────────┼─────────┼──────────┼─────────────┼─────────────┼──────────────┤
│ #1 │ 0.176 │ ConvolutionBackwardFilter │ 523.15 │ 8.35 │ 5.28 │ 84.24 │ None │ None │
│ 0.0627 │ 55.5% │ 53 │ GFLO │ TFLOPS │ GiB │ GiB/s │ │ │
│ 19.7% │ │ N/A │ │ │ │ │ │ │
├────────────────────┼──────────────┼───────────────────────────┼───────────────┼─────────┼──────────┼─────────────┼─────────────┼──────────────┤
│ #2 │ 0.221 │ ConvolutionBackwardData │ 508.05 │ 11.31 │ 5.05 │ 112.42 │ None │ None │
│ 0.0449 │ 69.6% │ 52 │ GFLO │ TFLOPS │ GiB │ GiB/s │ │ │
│ 14.1% │ │ N/A │ │ │ │ │ │ │
├────────────────────┼──────────────┼───────────────────────────┼───────────────┼─────────┼──────────┼─────────────┼─────────────┼──────────────┤
Please read megengine/utils/profile_analyze.py for more usages.