inference 2018 6

Long & Short Plans for Auguest

TRT 子图加强，支持更多 layer
已有代码整理与完善
- 统一 libpaddle_inference_api 和 libpaddle_fluid
- 增加 inference demo 的 ci
- transpiler的优化，从python端到C++端
- 已有code的清理优化（包括anakin的cmake等）
- 单测的完善（重新测试注释掉的单测）
原生性能优化：
- 模型性能优化最佳实践，后续会陆续有各种业务模型要求上线优化， CPU/GPU，需要一些总结，提升一个模型可以优化多个 op
  - 提升预测方便程度，模型训完可以直接预测
  - 系统性优化原生op
- 核心模型性能差距比Anakin的性能少于50%，扩大模型优化覆盖率，总体上做一些性能优化
较长期目标：
- 面向目标编程，现有的机器环境也需要完整的性能优化方案，有一个从近到远的聚焦
- tf serving 类似框架，一劳永逸解决编译问题，提升上线效率
- 通过 batch 提升/中和性能
- INT8 inference。
- JIT 针对AVX2和AVX512的优化，以及未来VNNI指令集优化。
- lookup table + sequence pool fusion
intel合作：
- 与 MKLDNN 团队分工明确，后续希望能有更多人力聚焦到原生 inference 的框架和优化上
- MKLDNN 按 Model RoadMap 持续优化
- ngraph和openvino的支持态度
- 5117 性能差问题

6/27

Need discussion this week

本周进展

MKLDNN

Merge 一些 op

@intel-team

[merge] added cycling the cifar and flowers datasets: https://github.com/PaddlePaddle/Paddle/pull/11640
[merge] MKLDNN elementwis_add with default broadcast operations: https://github.com/PaddlePaddle/Paddle/pull/11544#pullrequestreview-131558634
[merge] bnorm+relu fuse for mkldnn (inference) : https://github.com/PaddlePaddle/Paddle/pull/11434

@luotao

测试Parallel Executor的CPU多线程效果 (with @zhaochengduo)，和V2已经对齐了：https://github.com/PaddlePaddle/Paddle/issues/11620
- 训练: Fluid 和 V2(0.11.0) 持平，预测: Fluid 比 V2(0.11.0) 快
- 增加clone（for_test）的使用示例: https://github.com/PaddlePaddle/Paddle/pull/11661

@tangjian

[merged] enable dynamic load mklml lib on fluid: https://github.com/PaddlePaddle/Paddle/pull/11596

CPU核心模型优化

OCR
- openblas版本的batch预测，已经和线上持平（稍微略慢，不超过4%） @yangjingyuan 已提测。
- 使用dlopen方式的mkl版本， @yangjingyuan 这周四测。
Abacus：
- 收集了Abacus对MKL的需求，@intel-huying @intel-brain
- 评测：升级CPU机器从5117到6148后，能带来将近一倍的提速http://agroup.baidu.com/share/office/852ea4ea08a14dcca47f1b8cb19952e2
[WIP] 尝试使用-Wl,-rpath,$ORIGIN 解决用户报的 "libmklml_intel.so couldn't be found" issues: https://github.com/PaddlePaddle/Paddle/issues/11002

高层 API

MERGED demo/vis inference @tangjian @chunwei @luotao
- https://github.com/PaddlePaddle/Paddle/pull/11708 @chunwei
MERGED feature/analysis to support sub-graph for TRT engine
- https://github.com/PaddlePaddle/Paddle/pull/11538
MERGED bugfix/add_inference_lib_to_release
- https://github.com/PaddlePaddle/Paddle/pull/11455
MERGED inference high level api fix grammer
- https://github.com/PaddlePaddle/Paddle/pull/11718
MERGED add anakin release
- https://github.com/PaddlePaddle/Paddle/pull/11747

release-note

Inference 高层 API 增加中英文文档
- https://github.com/PaddlePaddle/Paddle/pull/11718
- https://github.com/PaddlePaddle/Paddle/pull/11731
Inference lib 打包 Anakin lib
- https://github.com/PaddlePaddle/Paddle/pull/11747

下周计划

文档添加 F&Q，方便用户查阅
TRT engine 添加 demo
Intel 沟通 openvolue, ngraph 等在不同机器上的性能指标 @chengsi @intel
Paddle serving 与凤巢沟通
发邮件

6/20

Need discussion this week

migrate some transpiler to data flow graph framework
MKLDNN & MKLML/Openblas性能对比
7.2日intel会议准备。
- 人员：对方6-8人，波兰经理marcin，ngraph对接人baojun，LiuBrain，Jason等
- 时间：2小时
- 内容：下半年2个季度的规划

本周进展

MKLDNN

@intel-team

layout support
- [merge] MKLDNN layout: activation operator: https://github.com/PaddlePaddle/Paddle/pull/11124
- [merge] MKLDNN layout: Gaussian random operator: https://github.com/PaddlePaddle/Paddle/pull/11523
- [merge] MKLDNN layout: sum operator: https://github.com/PaddlePaddle/Paddle/pull/11102
[review] bnorm+relu fuse for mkldnn (inference): https://github.com/PaddlePaddle/Paddle/pull/11434
[review] MKLDNN elementwis_add with default broadcast operations: https://github.com/PaddlePaddle/Paddle/pull/11544

@tangjian

调整mkldnn初始化时的cpu内存大小: https://github.com/PaddlePaddle/Paddle/pull/11525
修复use_mkldnn全局flag在纯CPU环境下的使用错误: https://github.com/PaddlePaddle/Paddle/pull/11395

CPU核心模型优化

OCR：
- 使用dynamic_loader加载MKL，与现有服务隔离（初版）：https://github.com/PaddlePaddle/Paddle/compare/develop...jianhang-liu:dynamic_load_mklml @intel-liubrain @tangjian @luotao
- 在openblas版本上改成batch预测，降低耗时，做为base版本：@OCR-yangjingyuan
Abacus:
- 了解Abacus对CPU优化的需求，@qiaolongfei @intel-China 周五会议。
NLP:
- QA(hejiajia_91)测试遇到编译MKLDNN不过，@tangjian fixed https://github.com/PaddlePaddle/Paddle/pull/11571

高层 API

@chunwei

子图 for TensorRT 初步跑通，基于图的 analysis 优化框架基本成型
- OPEN feature/analysis to support subgraph for TRT
  - https://github.com/PaddlePaddle/Paddle/pull/11538
- OPEN feature/trt add softmax converter
  - https://github.com/PaddlePaddle/Paddle/pull/11526
- MERGED bugfix/trt-op
  - https://github.com/PaddlePaddle/Paddle/pull/11487
- MERGED feature/pass manager
  - https://github.com/PaddlePaddle/Paddle/pull/11440
高层 API output buffer 统一管理
- MERGED inference/unify output buffer management
  - https://github.com/PaddlePaddle/Paddle/pull/11569
图像3个demo 完成一个，其他基本类似
- OPEN WIP demo/mobilenet inference
  - https://github.com/PaddlePaddle/Paddle/pull/11510
bugfix
- MERGED bugfix/add_inference_lib_to_release
  - https://github.com/PaddlePaddle/Paddle/pull/11482
- MERGED bugfix/anakin-ci
  - https://github.com/PaddlePaddle/Paddle/pull/11473

release-note

重要更新
- 预测库高层 API 统一和简化 output buffer memory 管理
  - https://github.com/PaddlePaddle/Paddle/pull/11569

下周计划

parallel_executor上mklml性能测试 @luotao @zhaochengduo
使用dynamic_loader解决多个MKL库不兼容问题 @tangjian @luotao @intel-huying @intel-liubrain
@tangjian 使用dynamic_loader解决 LD PATH问题，相关issue有 #11452 #9034，
- 除了mklml的so，还需要解决iomp.so和mkldnn.so
- 同时还需要兼顾 v2的版本，因为需要整个paddle都不添加LD。 @chunwei
合入高层 API 文档
TRT 子图增加更多测试，analysis 框架加强稳定性
整体代码整理

6/13

Need discussion this week

MKL dynamic link @luotao @tangjian （有风险）
parallel_executor的性能 vs V2 性能（有风险）
NLP文本预测，能push上线么？另外一个NLP任务需要拿过来测一下么？
需要实现JIT (just in time)版本的CPU么
Large Sparse Matrix support in MKLDNN

Paddle serving @luotao @tangjian @liuyiqun @yanchunwei

整体进展

高层API 90%
- DONE 确定高层接口
- DONE 原生实现
- DOING 阿拉金
子图 35%
- DOING framework
- DOING TRT 支持
MKLDNN 30%
CPU核心模型优化 70%
- DOING OCR CPU
- DOING 情感分类 CPU
文档 && CI 40%
- DONE 旧接口文档初步
- TODO 新接口文档

本周进展

MKLDNN

ResNet50，flowers数据集，batch_size=128，CPU8180上。与TensorFlow的数据对齐，每秒76张图片 http://agroup.baidu.com/paddlepaddle/view/office/975392。 @intel-team
[Merge] MKLDNN layout support，MKLDNN layer支持: https://github.com/PaddlePaddle/Paddle/pull/11040 @intel-team
- [Merge] Support for pool operator: https://github.com/PaddlePaddle/Paddle/pull/11101
- [Merge] Support for batch norm operator: https://github.com/PaddlePaddle/Paddle/pull/11098
- [Merge] Support for convolution operator: https://github.com/PaddlePaddle/Paddle/pull/11099
- [Review] Support for activation operator: https://github.com/PaddlePaddle/Paddle/pull/11124
- [Review] Support for sum operator: https://github.com/PaddlePaddle/Paddle/pull/11102
增加全局控制use_mkldnn的flag：
- [Merge] https://github.com/PaddlePaddle/Paddle/pull/11319 @luotao
- [Doing] https://github.com/PaddlePaddle/Paddle/pull/11395 @tangjian

高层API

@chunwei

OPEN Feature/pass manager
- https://github.com/PaddlePaddle/Paddle/pull/11440
OPEN bugfix/trt op with kernel
- https://github.com/PaddlePaddle/Paddle/pull/11408
OPEN doc/inference api
- https://github.com/PaddlePaddle/Paddle/pull/11332
OPEN feature/anakin ci
- 发现 Anakin 的 protobuf 版本无法匹配，目前 anakin lib 无法打包到预测库
- https://github.com/PaddlePaddle/Paddle/pull/11330
CLOSED loose threshold of TRT for CI in different model
- https://github.com/PaddlePaddle/Paddle/pull/11305

CPU核心模型优化

MKLML多线程的加速效果（使用ParallelDo）：16线程，加速比不到8。http://agroup.baidu.com/paddlepaddle/md/article/964808 @luotao @tangjian
OCR：使用fluid mkl库后 @luotao @OCR-yangjingyuan
- 车牌识别服务（从fluid的openblas版本换到fluid的mkl版本）精度正常，速度与线上持平。
- 其他服务（如驾驶证/行驶证（使用其他预测库）精度有损失（某一个指标上差2个点）。@intel-huying，@intel-liubrain讨论：与之前hang的问题类似，由于MKL符号表没有被限制，导致影响其他预测库。解决方案：尝试使用dlopen，即dynamic_loader.
NLP：anakin开始复现我们的数据，线上用的v3,v4,5117等cpu，我们测的是V2的cpu。
- [WIP] 序列标注开源任务中，遇到的是预测速度比较慢的问题，正在看，接口@焦振宇
- [Merged] add initial memory flag in MB for infer
- [Merged] Infer multi-threads API Demo and UT
- [WIP] scope thread safe

文档 && CI版本

@luotao

增加了cuda9.0_cudnn7_avx_mkl的下载whl地址，so地址待更新：https://github.com/PaddlePaddle/Paddle/pull/11417 @chunwei
OPEN bugfix/add_inference_lib_to_release
- https://github.com/PaddlePaddle/Paddle/pull/11455

下周计划

parallel_executor上mklml性能测试 @luotao @zhaochengduo
使用dynamic_loader解决多个MKL库不兼容问题 @luotao @tangjian @intel-huying @intel-liubrain
machine_translation unique key重名问题
mnist 手动子图跑通 @chunwei

release-note

6/6

Need discussion this week

MKLDNN下次会议时间，周三下午2点以后？
inference lib自动部署时，选择MKL静态库进行编译。（Intel已经复现问题，但短期内mklml库修不好）
多线程预测的示例，如何在线程间share全局model。目前的示例是单测形式，还需要增强。@NLP-dongdaxiang etc咨询过。
Problem about use_mkldnn flag：https://github.com/PaddlePaddle/Paddle/issues/10765
TODO 图像4个 demo

整体进展

高层API 80%
- DONE 确定高层接口
- DONE 原生实现
- DOING 阿拉金
子图 35%
- DOING framework
- DOING TRT 支持
MKLDNN 30%
CPU核心模型优化 70%
- DOING OCR CPU
- DOING 情感分类 CPU
文档 && CI 40%
- DONE 旧接口文档初步
- TODO 新接口文档

本周进展

MKLDNN

[Merge] add ParallelDo CPU multi-thread training example for benchmark/fluid, fix test and flower dataset error and refine the codes 增加CPU多线程训练的benchmark脚本（使用ParallelDo）: @luotao
[Merge] rename Mkldnn to MKLDNN，更新MKLDNN命名规则: https://github.com/PaddlePaddle/Paddle/pull/11147 @tangjian
layout related PR: @intel-team
- [review] MKLDNN layout support，MKLDNN layer支持: https://github.com/PaddlePaddle/Paddle/pull/11040
- Support for sum operator: https://github.com/PaddlePaddle/Paddle/pull/11102
- Support for pool operator: https://github.com/PaddlePaddle/Paddle/pull/11101
- Support for convolution operator: https://github.com/PaddlePaddle/Paddle/pull/11099
- Support for batch norm operator: https://github.com/PaddlePaddle/Paddle/pull/11098
- Support for activation operator: https://github.com/PaddlePaddle/Paddle/pull/11124

高层API

rewrite unittest of trt_activation_op ，重写了tensorrt activation op及其单侧: https://github.com/PaddlePaddle/Paddle/pull/11222 @luotao
高层 API 添加 demo 及稳定性增强，尚缺多线程 Clone 的demo(bug fixing) @chunwei
merged, simplify inference api
merged, inference API little fix
merged, feature/simple inference demo
merged, Feature/anakin embed

@tangjian

@chunwei

sub-graph related, tensorrt_engine_op ready, 手动子图可以跑通

CPU核心模型优化

OCR: @luotao
- 最新的MKLML库与老版本的MKL静态库的兼容问题（线上服务hang）: Intel @[email protected] 已经复现了该问题，并汇报给了MKL组。
- 给@yangjingyuan提供，使用paddlepaddle/paddle_manylinux_devel:cuda8.0_cudnn5镜像+MKL静态库的编译版本。
NLP: @tangjian
- 内存初始很大问题 large memory when infer issue
- 多线程随机abort的问题，abort in multi-threads inference on CPU
- TODO:
  - 用物理内存指定初始内存。
  - 加多线程API的UT和Demo。
  - 验证高层API的正确性。

文档 && CI版本

@luotao

[Merge] add build and install document of fluid inference library，增加自动下载和编译安装预测库的文档: https://github.com/PaddlePaddle/Paddle/pull/11090

下周计划

在CI上自动部署使用MKL静态库的paddle预测库。@luotao
MKLDNN @luotao @tangjian
- 6148机器上mkldnn/mklml/openblas在ResNet50/flowers数据集上的性能测试。
- 增加全局use_mkldnn的环境变量
- 和Intel Team合作完成MKLDNN layout的一系列代码。
- 沟通 7.5 的目标，发邮件通知 MKLDNN 团队
高层 API 正式发布 @chunwei
- 添加使用文档，并弃用旧接口文档，全力推高层 API
- move contrib/inference to fluid/inference
- 子图 MLP benchmark
- 自动子图框架跑通

release-note

Release Notes

inference 2018 6

Long & Short Plans for Auguest

6/27

Need discussion this week

本周进展

MKLDNN

CPU核心模型优化

高层 API

release-note

下周计划

6/20

Need discussion this week

本周进展

MKLDNN

CPU核心模型优化

高层 API

release-note

下周计划

6/13

Need discussion this week

整体进展

本周进展

MKLDNN

高层API

CPU核心模型优化

文档 && CI版本

下周计划

release-note

6/6

Need discussion this week

整体进展

本周进展

MKLDNN

高层API

CPU核心模型优化

文档 && CI版本

下周计划

release-note

Clone this wiki locally