diff --git a/lite/example/cpp_example/mge/README.md b/lite/example/cpp_example/mge/README.md
deleted file mode 100644
index f41115d2..00000000
--- a/lite/example/cpp_example/mge/README.md
+++ /dev/null
@@ -1,166 +0,0 @@
-# Example
-
-在该 example 目录中实现了一系列调用 lite 接口来实现 inference 的例子，主要
-是演示 lite 中不同接口的调用来实现不同情况下的 inference 功能。这里所有的 example 
-都是使用 shufflenet 来进行演示。
-
-## Example bazel 的编译和运行
-
-* 参考主目录下面的 README.md 搭建 megvii3 bazel 的编译环境，编译 CPU 版本
-```bash
-    ./bazel build //brain/megbrain/lite:lite_examples --cpu="k8" \
-        --compiler="gcc9" -c opt
-```
-* 运行时需要指定运行的具体 example 名字，运行的模型，模型运行的数据
- * 获取所有的 example 名字
-```
-    bazel-bin/brain/megbrain/lite/lite_examples
-```
- * 运行 example，下面命令运行 basic_load_from_memory
-```
-    bazel-bin/brain/megbrain/lite/lite_examples \
-        basic_load_from_memory \
-        path-to-megbrain/lite/test/resource/lite/shufflenet.mge \
-        path-to-megbrain/lite/test/resource/lite/input_data.npy
-```
-
-## basic 使用
-
-* **实现在文件 basic.cpp 中, 包括 basic_load_from_path 和
- basic_load_from_memory**
-
-* 该 example 使用 lite 来完成基本的 inference 功能，load 模型使用默认的配置，
-进行 forward 之前将输入数据 copy 到输入 tensor 中，完成 forward 之后，再将
-数据从输出 tensor 中 copy 到用户的内存中，输入 tensor 和输出 tensor 都是从
-Network 中通过 name 来获取的，输入输出 tensor 的 layout 也可以从对应的 tensor
-中直接获取获取，**输出 tensor 的 layout 必须在 forward 完成之后获取才是正确的。**
-
-## 输入输出指定的内存
-
-* **实现在 reset_io.cpp 中，包括两个 example，reset_input 和 reset_input_output
-两个 example。**
-
-* 该 example 中演示输入 tensor 的内存为用户指定的内存（该内存中已经保存好输入
-数据），输出 tensor 也可以是用户指定的内存，这样 Network 完成 Forward 之后就会将数据
-保存在指定的输出内存中。如此减少不必要的 memory copy 的操作。
-
-* 主要是通过 tensor 中的 reset 接口，该接口可以重新指定 tensor 的内存和对应的
-layout，如果 layout 没有指定，默认为 tensor 中原来的 layout。
-
-* **该方法中由于内存是用户申请，需要用户提前知道输入，输出 tensor 对应的 layout，然后
-根据 layout 来申请内存，另外通过 reset 设置到 tensor 中的内存，生命周期不由 tensor
-管理，由外部用户来管理。**
-
-## 输入输出指定 device 上内存
-
-* **实现在 device_io.cpp 中，device_input 和 device_input_output 两个 example。**
-
-* 该 example 中配置模型运行在 device(CUDA) 上，并且使用用户提前申请的 device 上的内存
-作为模型运行的输入和输出。需要在 Network 构建的时候指定输入输出的在 device 上，不设置默认
-在 CPU 上，其他地方和**输入输出为用户指定的内存**的使用相同
-
-* 可以通过 tensor 的 is_host() 接口来判断该 tensor 在 device 端还是 host 端
-
-## 申请 pinned host 内存作为输入
-
-* **实现在 device_io.cpp 中，函数名字为 pinned_host_input。**
-
-* 这个 example 中模型运行在 device(CUDA) 上，但是输入输出在 CPU 上，为了加速 host2device 的
-copy，将 CPU 上的 input tensor 的内存指定提前申请为 cuda pinned 内存。目前如果输出
-output tensor 不是 device 上的时候，默认就是 pinned host 的。
-
-* 申请 pinned host 内存的方法是：构建 tensor 的时候指定 device，layout，以及 is_host_pinned
-参数，这样申请的内存就是 pinned host 的内存。
-
-    ```C
-     bool is_pinned_host = true;
-     auto tensor_pinned_input =
-             Tensor(LiteDeviceType::LITE_CUDA, input_layout, is_pinned_host);
-    ```
-
-## 用户指定内存分配器
-
-* **实现在 user_allocator.cpp 中，函数名为：config_user_allocator。**
-
-* 这个例子中使用用户自定义的 CPU 内存分配器演示了用户设置自定义的 Allocator 的方法，用户自定义
-内存分配器需要继承自 lite 中的 Allocator 基类，并实现 allocate 和 free 两个接口。目前在 CPU
-上验证是正确的，其他设备上有待测试。
-
-* 设置自定定义内存分配器的接口为 Network 中如下接口：
-    ```C
-    Network& set_memory_allocator(std::shared_ptr<Allocator> user_allocator);
-    ```
-
-## 多个 Network 共享同一份模型 weights
-
-* **实现在 network_share_weights.cpp 中，函数名为：network_share_same_weights。**
-
-* 很多情况用户希望多个 Network 共享同一份 weights，因为模型中 weights 是只读的，这样可以节省
-模型的运行时内存使用量。这个例子主要演示了 lite 中如何实现这个功能，首先创建一个新的 Network，
-用户可以指定新的 Config 和 NetworkIO 以及其他一些配置，使得新创建出来的 Network 完成不同的
-功能。
-
-* 通过已有的 NetWork load 一个新的 Network 的接口为 Network 中如下接口：
-    ```C
-        static void shared_weight_with_network(
-            std::shared_ptr<Network> dst_network,
-            const std::shared_ptr<Network> src_network);
-    ```
-    * dst_network: 指新 load 出来的 Network
-    * src_network：已经 load 的老的 Network
-
-## CPU 绑核
-
-* **实现在 cpu_affinity.cpp 中，函数名为：cpu_affinity。**
-
-* 该 example 之中指定模型运行在 CPU 多线程上，然后使用 Network 中的
-set_runtime_thread_affinity 来设置绑核回调函数。该回调函数中会传递当前线程的 id 进来，用户可以
-根据该 id 决定具体绑核行为，在多线程中，如果线程总数为 n，则 id 为 n-1 的线程为主线程。
-
-## 用户注册自定义解密算法和 key
-
-* **实现在 user_cryption.cpp 中，函数名为：register_cryption_method 和 update_aes_key 。**
-
-* 这两个 example 主要使用 lite 自定义解密算法和更新解密算法的接口，实现了使用用户自定的解密算法
-实现模型的 load 操作。在这个 example 中，自定义了一个解密方法，(其实没有做任何事情，
-将模型两次异或上 key 之后返回，等于将原始模型直接返回)，然后将其注册到 lite 中，后面创建 Network 时候在其
-config 中的 bare_model_cryption_name 指定具体的解密算法名字。在第二个 example 展示了对其
-key 的更新操作。
-目前 lite 里面定义好了几种解密算法：
-    * AES_default : 其 key 是由 32 个 unsighed char 组成，默认为0到31
-    * RC4_default : 其 key 由 hash key 和 enc_key 组成的8个 unsigned char，hash
-      key 在前，enc_key 在后。
-    * SIMPLE_FAST_RC4_default : 其 key 组成同 RC4_default。
-大概命名规则为：前面大写是具体算法的名字，'_'后面的小写，代表解密 key。
-具体的接口为：
-    ```C
-    bool register_decryption_and_key(std::string decrypt_name,
-                                    const DecryptionFunc& func,
-                                    const std::vector<uint8_t>& key);
-    bool update_decryption_or_key(std::string decrypt_name,
-                                    const DecryptionFunc& func,
-                                    const std::vector<uint8_t>& key);
-    ```
-register 接口中必须要求三个参数都是正确的值，update中 decrypt_nam 必须为已有的解密算法，
-将使用 func 和 key 中不为空的部分对 decrypt_nam 解密算法进行更新
-
-## 异步执行模式
-
-* **实现在 basic.cpp 中，函数名为：async_forward。**
-
-* 用户通过接口注册异步回调函数将设置 Network 的 Forward 模式为异步执行模式，
-目前异步执行模式只有在 CPU 和 CUDA 10.0 以上才支持，在 inference 时异步模式，
-主线程可以在工作线程正在执行计算的同时做一些其他的运算，避免长时间等待，但是
-在一些单核处理器上没有收益。
-
-## 纯 C example
-
-* **实现在 lite_c_interface.cpp，函数名为：basic_c_interface，
-device_io_c_interface，async_c_interface**
-
-* Lite 完成对 C++ 接口的封装，对外暴露了纯 C 的接口，用户如果不是源码依赖 Lite
-的情况下，应该使用纯 C 接口来完成集成。
-* 纯 C 的所有接口都是返回一个 int，如果这个 int 的数值不为 0，则又错误产生，需要
-调用 LITE_get_last_error 来获取错误信息。
-* 纯 C 的所有 get 函数都需要先定义一个对应的对象，然后将该对象的指针传递进接口，
-Lite 会将结果写入到 对应指针的地址里面。
diff --git a/lite/pylite/pylite.md b/lite/pylite/pylite.md
deleted file mode 100755
index 12ab761e..00000000
--- a/lite/pylite/pylite.md
+++ /dev/null
@@ -1,218 +0,0 @@
-# PyLite
-
-Lite 的 python 接口提供更加方便灵活的使用 Lite 进行模型 Inference，满足如下条件的环境都可以使用:
-
-* OS 可以安装 [Python3](https://www.python.org/downloads/)
-* [BUILD_README](../../scripts/cmake-build/BUILD_README.md) 中支持推理编译的平台
-
-## 安装
-### whl 包安装
-目前预编译发布的 Lite 的 whl 包详情如下:
-
-* 提供 Linux-x64(with CUDA)、windows-x64(with CUDA)、macos-x64(cpu only) 平台预编译包
-* 可以直接通过 pip3 安装。其他 OS-ARCH 的包，如有需要，可以 build from src 参考 [BUILD_README](../../scripts/cmake-build/BUILD_README.md)
-* 预编译包的构建流程可以参考 [BUILD_PYTHON_WHL_README.md](../../scripts/whl/BUILD_PYTHON_WHL_README.md)
-
-开源版本: 预编译的包会随着 MegEngine 的发版发布，版本号和 MegEngine 保持一致,安装方式:
-
-```shell
-python3 -m pip install --upgrade pip
-python3 -m pip install megengine -f https://megengine.org.cn/whl/mge.html
-```
-安装后， 就可以通过 import megenginelite 进行使用了
-
-### develop 调试
-
-开发模式下，可以使用 Cmake 编译出 lite 动态库,依然参考 [BUILD_README](../../scripts/cmake-build/BUILD_README.md):
-
-* Windows 平台，编译出来的 dll 是 lite_shared_whl.dll
-* None Windows 平台，编译出来的 so 是 liblite_shared_whl.so
-
-* 编译上述库的步骤:
-    * clone 代码
-    ```shell
-    开源版本：git clone git@github.com:MegEngine/MegEngine.git
-    ```
-    * 编译准备
-    ```shell
-    开源版本: cd MegEngine
-    bash ./third_party/prepare.sh
-    ```
-    * 编译 HOST 版本：
-    ```shell
-    ./scripts/cmake-build/host_build.sh
-    ```
-    * 编译 HOST 版本 (with CUDA):
-    ```shell
-    ./scripts/cmake-build/host_build.sh -c
-    ```
-    * 编译 Android 平台：
-
-    ```shell
-    scripts/cmake-build/cross_build_android_arm_inference.sh
-    ```
-
-    * 其他OS-ARCH可参考 [BUILD_README](../../scripts/cmake-build/BUILD_README.md)
-    * 编译完成之后，相应的库可在 build_dir 下找到， 这里假设它的目录是LITE_LIB_PATH=path_of_lite_shared_whl
-    * 开始使用 megenginelite
-    ```shell
-    export LITE_LIB_PATH=path_of_lite_shared_whl
-    export PYTHONPATH=lite/pylite:$PYTHONPATH
-    然后就可以 import megenginelite 进行使用了
-    ```
-
-## python3 中使用 megenginelite
-Lite 的 python3 接口是对其 C/C++ 接口的一层封装，他们使用的模型都是相同的模型格式。megenginelite 提供两种数据接口，分别是 LiteTensor 和 LiteNetwork。
-
-### LiteTensor
-LiteTensor 提供了用户对数据的操作接口，提供了接口包括:
-* fill_zero: 将 tensor 的内存设置为全0
-* share_memory_with: 可以和其他 LiteTensor 的共享内存
-* copy_from: 从其他 LiteTensor 中 copy 数据到自身内存中
-* reshape: 改变该 LiteTensor 的 shape，内存数据保持不变
-* slice: 对该 LiteTensor 中的数据进行切片，需要分别指定每一维切片的 start，end，和 step。
-* set_data_by_share: 调用之后使得该 LiteTensor 中的内存共享自输入的 array 的内存，输入的 array 必须是numpy 的 ndarray，并且 tensor 在 CPU 上
-* set_data_by_copy: 该 LiteTensor 将会从输入的 data 中 copy 数据，data 可以是 list 和 numpy 的 ndarray，需要保证 data 的数据量不超过 tensor 的容量，tensor 在 CPU 上
-* to_numpy: 将该 LiteTensor 中数据 copy 到 numpy 的 array 中，返回给用户，如果是非连续的 LiteTensor，如 slice 出来的，将 copy 到连续的 numpy array 中，该接口主要数为了 debug，有性能问题。
-
-#### 使用 example
-* LiteTensor 设置数据 example
-```
-def test_tensor_set_data():
-    layout = LiteLayout([2, 16], "int8")
-    tensor = LiteTensor(layout)
-    assert tensor.nbytes == 2 * 16
-
-    data = [i for i in range(32)]
-    tensor.set_data_by_copy(data)
-    real_data = tensor.to_numpy()
-    for i in range(32):
-        assert real_data[i // 16][i % 16] == i
-
-    arr = np.ones([2, 16], "int8")
-    tensor.set_data_by_copy(arr)
-    real_data = tensor.to_numpy()
-    for i in range(32):
-        assert real_data[i // 16][i % 16] == 1
-
-    for i in range(32):
-        arr[i // 16][i % 16] = i
-    tensor.set_data_by_share(arr)
-    real_data = tensor.to_numpy()
-    for i in range(32):
-        assert real_data[i // 16][i % 16] == i
-
-    arr[0][8] = 100
-    arr[1][3] = 20
-    real_data = tensor.to_numpy()
-    assert real_data[0][8] == 100
-    assert real_data[1][3] == 20
-```
-* tensor 共享内存 example
-```python
-def test_tensor_share_memory_with():
-    layout = LiteLayout([4, 32], "int16")
-    tensor = LiteTensor(layout)
-    assert tensor.nbytes == 4 * 32 * 2
-
-    arr = np.ones([4, 32], "int16")
-    for i in range(128):
-        arr[i // 32][i % 32] = i
-    tensor.set_data_by_share(arr)
-    real_data = tensor.to_numpy()
-    for i in range(128):
-        assert real_data[i // 32][i % 32] == i
-
-    tensor2 = LiteTensor(layout)
-    tensor2.share_memory_with(tensor)
-    real_data = tensor.to_numpy()
-    real_data2 = tensor2.to_numpy()
-    for i in range(128):
-        assert real_data[i // 32][i % 32] == i
-        assert real_data2[i // 32][i % 32] == i
-
-    arr[1][18] = 5
-    arr[3][7] = 345
-    real_data = tensor2.to_numpy()
-    assert real_data[1][18] == 5
-    assert real_data[3][7] == 345
-```
-更多的使用可以参考 pylite 中 test/test_tensor.py 中的使用
-### LiteNetwork
-LiteNetwork 主要为用户提供模型载入，运行等功能。使用的模型见 lite 的 readme 中关于模型的部分
-* CPU 基本模型载入运行的 example
-```
-def test_network_basic():
-    source_dir = os.getenv("LITE_TEST_RESOURCE")
-    input_data_path = os.path.join(source_dir, "input_data.npy")
-    # read input to input_data
-    input_data = np.load(input_data_path)
-    model_path = os.path.join(source_dir, "shufflenet.mge")
-
-    network = LiteNetwork()
-    network.load(model_path)
-
-    input_name = network.get_input_name(0)
-    input_tensor = network.get_io_tensor(input_name)
-    output_name = network.get_output_name(0)
-    output_tensor = network.get_io_tensor(output_name)
-
-    assert input_tensor.layout.shapes[0] == 1
-    assert input_tensor.layout.shapes[1] == 3
-    assert input_tensor.layout.shapes[2] == 224
-    assert input_tensor.layout.shapes[3] == 224
-    assert input_tensor.layout.data_type == LiteDataType.LITE_FLOAT
-    assert input_tensor.layout.ndim == 4
-
-    # copy input data to input_tensor of the network
-    input_tensor.set_data_by_copy(input_data)
-    for i in range(3):
-        network.forward()
-        network.wait()
-
-    output_data = output_tensor.to_numpy()
-    print('shufflenet output max={}, sum={}'.format(output_data.max(), output_data.sum()))
-```
-* CUDA 上使用 device 内存作为模型输入，需要在构造 network 候配置 config 和 IO 信息
-```
-def test_network_device_IO():
-    source_dir = os.getenv("LITE_TEST_RESOURCE")
-    input_data_path = os.path.join(source_dir, "input_data.npy")
-    model_path = os.path.join(source_dir, "shufflenet.mge")
-    # read input to input_data
-    input_data = np.load(input_data_path)
-    input_layout = LiteLayout([1, 3, 224, 224])
-    host_input_data = LiteTensor(layout=input_layout)
-    host_input_data.set_data_by_share(input_data)
-    dev_input_data = LiteTensor(layout=input_layout, device_type=LiteDeviceType.LITE_CUDA)
-    dev_input_data.copy_from(host_input_data)
-
-    # construct LiteOption
-    options = LiteOptions()
-    options.weight_preprocess = 1
-    options.var_sanity_check_first_run = 0
-    net_config = LiteConfig(device_type=LiteDeviceType.LITE_CUDA, option=options)
-
-    # constuct LiteIO, is_host=False means the input tensor will use device memory
-    input_io = LiteIO("data", is_host=False)
-    ios = LiteNetworkIO()
-    ios.add_input(input_io)
-
-    network = LiteNetwork(config=net_config, io=ios)
-    network.load(model_path)
-
-    input_name = network.get_input_name(0)
-    dev_input_tensor = network.get_io_tensor(input_name)
-    output_name = network.get_output_name(0)
-    output_tensor = network.get_io_tensor(output_name)
-
-    # copy input data to input_tensor of the network
-    dev_input_tensor.share_memory_with(dev_input_data)
-    for i in range(3):
-        network.forward()
-        network.wait()
-
-    output_data = output_tensor.to_numpy()
-    print('shufflenet output max={}, sum={}'.format(output_data.max(), output_data.sum()))
-```
-更多的使用可以参考 pylite 中 test/test_network.py 和 test/test_network_cuda.py 中的使用