Intel® Extension for Transformers is an innovative toolkit to accelerate Transformer-based models on Intel platforms, in particular effective on 4th Intel Xeon Scalable processor Sapphire Rapids (codenamed Sapphire Rapids). The toolkit provides the key features and examples as below:
-
Seamless user experience of model compressions on Transformers-based models by extending Hugging Face transformers APIs and leveraging Intel® Neural Compressor
-
Advanced software optimizations and unique compression-aware runtime (released with NeurIPS 2022's paper Fast Distilbert on CPUs and QuaLA-MiniLM: a Quantized Length Adaptive MiniLM, and NeurIPS 2021's paper Prune Once for All: Sparse Pre-Trained Language Models)
-
Accelerated end-to-end Transformer-based applications such as Stable Diffusion, GPT-J-6B, BLOOM-176B, T5, and SetFit
pip install intel-extension-for-transformers
For more installation method, please refer to Installation Page
from datasets import load_dataset, load_metric
from transformers import AutoConfig,AutoModelForSequenceClassification,AutoTokenizer
raw_datasets = load_dataset("glue", "sst2")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
raw_datasets = raw_datasets.map(lambda e: tokenizer(e['sentence'], truncation=True, padding='max_length', max_length=128), batched=True)
from intel_extension_for_transformers.optimization import QuantizationConfig, metrics, objectives
from intel_extension_for_transformers.optimization.trainer import NLPTrainer
config = AutoConfig.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english",num_labels=2)
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english",config=config)
model.config.label2id = {0: 0, 1: 1}
model.config.id2label = {0: 'NEGATIVE', 1: 'POSITIVE'}
# Replace transformers.Trainer with NLPTrainer
# trainer = transformers.Trainer(...)
trainer = NLPTrainer(model=model,
train_dataset=raw_datasets["train"],
eval_dataset=raw_datasets["validation"],
tokenizer=tokenizer
)
q_config = QuantizationConfig(metrics=[metrics.Metric(name="eval_loss", greater_is_better=False)])
model = trainer.quantize(quant_config=q_config)
input = tokenizer("I like Intel Extension for Transformers", return_tensors="pt")
output = model(**input).logits.argmax().item()
For more quick samples, please refer to Get Started Page. For more validated examples, please refer to Support Model Matrix
OVERVIEW | |||||||
---|---|---|---|---|---|---|---|
Model Compression | Neural Engine | Kernel Libraries | Examples | ||||
MODEL COMPRESSION | |||||||
Quantization | Pruning | Distillation | Orchestration | ||||
Neural Architecture Search | Export | Metrics/Objectives | Pipeline | ||||
NEURAL ENGINE | |||||||
Model Compilation | Custom Pattern | Deployment | Profiling | ||||
KERNEL LIBRARIES | |||||||
Sparse GEMM Kernels | Custom INT8 Kernels | Profiling | Benchmark | ||||
ALGORITHMS | |||||||
Length Adaptive | Data Augmentation | ||||||
TUTORIALS AND RESULTS | |||||||
Tutorials | Supported Models | Model Performance | Kernel Performance |
- Blog published on Medium: MLefficiency — Optimizing transformer models for efficiency (Dec 2022)
- NeurIPS'2022: Fast Distilbert on CPUs (Nov 2022)
- NeurIPS'2022: QuaLA-MiniLM: a Quantized Length Adaptive MiniLM (Nov 2022)
- Blog published by Cohere: Top NLP Papers—November 2022 (Nov 2022)
- Blog published by Alibaba: Deep learning inference optimization for Address Purification (Aug 2022)
- NeurIPS'2021: Prune Once for All: Sparse Pre-Trained Language Models (Nov 2021)