Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cm4mlops] development plan #1023

Open
22 of 48 tasks
gfursin opened this issue Nov 23, 2023 · 0 comments
Open
22 of 48 tasks

[cm4mlops] development plan #1023

gfursin opened this issue Nov 23, 2023 · 0 comments
Assignees

Comments

@gfursin
Copy link
Contributor

gfursin commented Nov 23, 2023

The feedback from the MLCommons TF on automation and reproducibility to extend CM workflows to support the following MLC projects:

  • check how to add network and multi-node code to MLPerf inference and CM automation (collaboration with MLC Network TF)

    • extend MLPerf inference with Flask code, gluing with our ref client/server code (Python and later C++) and CM wrapping
    • address suggestions from Nvidia
      • --network-server=IP1,IP2...
      • --network-client
  • continue improving unified CM interface to run MLPerf inference implementations from different vendors

    • Optimized MLPerf inference implementations
      • Intel submissions (see Intel docs)
        • Support installation of conda packages in CM
      • Qualcomm submission
        • Add CM scripts to preprocess, calibrate and compile QAIC models for ResNet50, RetinaNet and Bert
        • Test in AWS
        • Test on Thundercomm RB6
          • Automatic model installation from a host device
        • Automatic detection and usage of quantization parameters
      • Nvidia submission
      • Google submission
      • NeuralMagic submission
    • Add possibility to run any MLPerf implementation including ref
    • Add possibility to change target device (eg GeForce instead of A100)
    • Expose batch sizes from all existing MLPerf inference reference implementations (when applicable) in edge category in a unified way for ONNX, PyTorch and TF via the CM interface. Report implementations with hardwired batch size.
    • Request from Miro: improve MLPerf inference docs for various backends
  • Develop universal CM-MLPerf docker to run any implementation with local data set and model (similar to Nvidia and Intel but with a unified CM interface)

  • Prototype new universal CM workflow to run any app on any target (with C++/Android/SSH)

  • Add support for any ONNX+loadgen model testing with tuning (prototyped already)

  • Improve CM docs (basic CM message and tutorials/notes for "users" and "developers")

  • Update/improve a list of all reusable, portable and tech-agnostic CM-MLOps scripts

  • Improve CM logging (stdout and stderr)

  • Visualize CM script dependencies

  • Check other suggestions from student teams from SCC'23

  • Start adding FAQ/notes from Discord/GitHub discussions about CM-MLPerf

  • prototype/reuse above universal CM workflow with ABTF for

    • inference
      • support different targets (host, remove embedded, Android)
      • get all info about target
      • add Python and C++ code for loadgen with different backends (PyTorch, ONNX, TF, TFLite, QAIC)
      • add object detection with COCO and trained model from Rod (without accuracy for now)
      • connect with training CM workflow
    • training (https://github.com/mlcommons/abtf-ssd-pytorch)
      • present CM-MLPerf at Croissant TF and discuss possible collaboration (doc)
      • add CM script to get Croissant
      • add datasets via Croissant
      • train and save model in CM cache to be loaded to inference
      • test with Rod
    • present prototype progress in next ABTF meeting (Grigori)
  • unify experiment and visualization

    • prepare high-level meta to run the whole experiment
    • [ ]aggregate and visualize results
    • if MLPerf run is very short, we need to kind of calibrate it by multiplting N*10 for example similar to what I did in CK
@gfursin gfursin changed the title [cm-mlc] development plan until the end of 2023 [cm-mlc] development plan Jan 16, 2024
@gfursin gfursin removed the mil label Mar 20, 2024
@gfursin gfursin changed the title [cm-mlc] development plan [cm4mlops] development plan Oct 2, 2024
@arjunsuresh arjunsuresh removed their assignment Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants