From 28f0e701c203c10ec0e1c91679c13dcbb695c24d Mon Sep 17 00:00:00 2001
From: sekyonda <127536312+sekyondaMeta@users.noreply.github.com>
Date: Tue, 15 Aug 2023 15:15:32 -0400
Subject: [PATCH 1/5] Adding Performance Checklist to site

Adding performance checklist to site.
Plan to add this checklist to FAQs page in the future as well. For now, it can be accessed through the performance page.
---
 docs/performance_checklist.md | 40 +++++++++++++++++++++++++++++++++++
 docs/performance_guide.md     |  6 ++++++
 2 files changed, 46 insertions(+)
 create mode 100644 docs/performance_checklist.md

diff --git a/docs/performance_checklist.md b/docs/performance_checklist.md
new file mode 100644
index 0000000000..c2fc44c6d7
--- /dev/null
+++ b/docs/performance_checklist.md
@@ -0,0 +1,40 @@
+# Model Inference Optimization Checklist
+
+This checklist describes some steps that should be completed when diagnosing model inference performance issues.  Some of these suggestions are only applicable to NLP models (e.g., ensuring the input is not over-padded and sequence bucketing), but the general principles are useful for other models too.
+
+## General System Optimizations
+
+- Check the versions of PyTorch, Nvidia driver, and other components and update to the latest compatible releases.  Oftentimes known performance bugs have already been fixed.
+
+- Collect system-level activity logs to understand the overall resource utilizations. It’s useful to know how the model inference pipeline is using the system resources at a high level, as the first step of optimization.  Even simple CLI tools such as nvidia-smi and htop would be helpful.
+
+- Start with a target with the highest impact on performance.  It should be obvious from the system activity logs where the biggest bottleneck is – look beyond model inference, as pre/post processing can be expensive and can affect the end-to-end throughput just as much.
+
+- Quantify and mitigate the influence of slow I/O such as disk and network on end-to-end performance.  While optimizing I/O is out of scope for this checklist, look for techniques that use async, concurrency, pipelining, etc. to effectively “hide” the cost of I/O.
+
+- For model inference on input sequences of dynamic length (e.g., transformers for NLP), make sure the tokenizer is not over-padding the input.  If a transformer was trained with padding to a constant length (e.g., 512) and deployed with the same padding, it would run unnecessarily slow (orders of magnitude) on short sequences.
+
+- Vision models with input in JPEG format often benefit from faster JPEG decoding on CPU such as libjpeg-turbo and Pillow-SIMD, and on GPU such as torchvision.io.decode_jpeg and Nvidia DALI.
+As this [example](https://colab.research.google.com/drive/1NMaLS8PG0eYhbd8IxQAajXgXNIZ_AvHo?usp=sharing) shows, Nvidia DALI is about 20% faster than torchvision, even on an old K80 GPU.
+
+## Model Inference Optimizations
+
+Start model inference optimization only after other factors, the “low-hanging fruit”, have been extensively evaluated and addressed.
+
+- Use fp16 for GPU inference.  The speed will most likely more than double on newer GPUs with tensor cores, with negligible accuracy degradation.  Technically fp16 is a type of quantization but since it seldom suffers from loss of accuracy for inference it should always be explored. As shown in this [article](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html#abstract), use of fp16 offers speed up in large neural network applications.
+
+- Use model quantization (i.e., int8) for CPU inference.  Explore different quantization options: dynamic quantization, static quantization, and quantization aware training, as well as tools such as Intel Neural Compressor that provide more sophisticated quantization methods.
+
+- Balance throughput and latency with smart batching.  While meeting the latency SLA try larger batch sizes to increase the throughput.
+
+- Try [torchscript](https://pytorch.org/docs/stable/jit.html), [inference_mode](https://pytorch.org/docs/stable/generated/torch.inference_mode.html), and [optimize_for_inference](https://pytorch.org/docs/stable/generated/torch.jit.optimize_for_inference.html). Torschscript provides tools that incrementally transition models from purely python to a torchscript program which can be run independently of python. This gives us the ability to run models in environments where python may be at a disadvantage in terms of performance.
+
+- Try optimized inference engines such as onnxruntime, tensorRT, lightseq, ctranslate-2, etc.  These engines often provide additional optimizations such as operator fusion, in addition to model quantization.
+
+- Try model distillation.  This is more involved and often requires training data, but the potential gain can be large.  For example, MiniLM achieves 99% the accuracy of the original BERT base model while being 2X faster.
+
+- Try task parallelism.  Python’s GIL could affect effective multithreading, even for external native code.  For a system with 32 vCPUs, two inference sessions each with 16 threads often have higher throughput than a single inference session with 32 threads.  When testing multiple sessions, it is important to set torch.num_threads properly to avoid CPU contention.
+
+- For batch processing on sequences with different lengths, sequence bucketing could potentially improve the throughput by 2X.  In this case, a simple implementation of sequence bucketing is to sort all input by sequence length before feeding them to the model, as this reduces unnecessary padding when batching the sequences.
+
+While this checklist is not exhaustive, going through the items will likely help you squeeze more performance out of your model inference pipeline.
diff --git a/docs/performance_guide.md b/docs/performance_guide.md
index d72c31a4f2..6804f82d26 100644
--- a/docs/performance_guide.md
+++ b/docs/performance_guide.md
@@ -1,6 +1,8 @@
 # [Performance Guide](#performance-guide)
 In case you're interested in optimizing the memory usage, latency or throughput of a PyTorch model served with TorchServe, this is the guide for you.
 
+We have also created a quick checklist here for extra things to try outside of what is covered on this page. You can find the checklist [here](performance_checklist.md).
+
 ## Optimizing PyTorch
 
 There are many tricks to optimize PyTorch models for production including but not limited to distillation, quantization, fusion, pruning, setting environment variables and we encourage you to benchmark and see what works best for you.
@@ -92,3 +94,7 @@ Visit this [link]( https://github.com/pytorch/kineto/tree/main/tb_plugin) to lea
 <h4>TorchServe on the Animated Drawings App</h4>
 
 For some insight into fine tuning TorchServe performance in an application, take a look at this [article](https://pytorch.org/blog/torchserve-performance-tuning/). The case study shown here uses the Animated Drawings App form Meta to improve TorchServe Performance.
+
+<h4>Performance Checklist</h4>
+
+We have also created a quick checklist here for extra things to try outside of what is covered on this page. You can find the checklist [here](performance_checklist.md).

From 1949ce4d4c2294193a471746ec9e2f5a6ca2a75f Mon Sep 17 00:00:00 2001
From: sekyonda <127536312+sekyondaMeta@users.noreply.github.com>
Date: Tue, 15 Aug 2023 15:23:25 -0400
Subject: [PATCH 2/5] Spelling update

---
 docs/performance_checklist.md           |  2 +-
 ts_scripts/spellcheck_conf/wordlist.txt | 12 ++++++++++++
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/docs/performance_checklist.md b/docs/performance_checklist.md
index c2fc44c6d7..89fcf3a2fe 100644
--- a/docs/performance_checklist.md
+++ b/docs/performance_checklist.md
@@ -27,7 +27,7 @@ Start model inference optimization only after other factors, the “low-hanging
 
 - Balance throughput and latency with smart batching.  While meeting the latency SLA try larger batch sizes to increase the throughput.
 
-- Try [torchscript](https://pytorch.org/docs/stable/jit.html), [inference_mode](https://pytorch.org/docs/stable/generated/torch.inference_mode.html), and [optimize_for_inference](https://pytorch.org/docs/stable/generated/torch.jit.optimize_for_inference.html). Torschscript provides tools that incrementally transition models from purely python to a torchscript program which can be run independently of python. This gives us the ability to run models in environments where python may be at a disadvantage in terms of performance.
+- Try [torchscript](https://pytorch.org/docs/stable/jit.html), [inference_mode](https://pytorch.org/docs/stable/generated/torch.inference_mode.html), and [optimize_for_inference](https://pytorch.org/docs/stable/generated/torch.jit.optimize_for_inference.html). Torchscript provides tools that incrementally transition models from purely python to a torchscript program which can be run independently of python. This gives us the ability to run models in environments where python may be at a disadvantage in terms of performance.
 
 - Try optimized inference engines such as onnxruntime, tensorRT, lightseq, ctranslate-2, etc.  These engines often provide additional optimizations such as operator fusion, in addition to model quantization.
 
diff --git a/ts_scripts/spellcheck_conf/wordlist.txt b/ts_scripts/spellcheck_conf/wordlist.txt
index 902439747a..91ea9dc273 100644
--- a/ts_scripts/spellcheck_conf/wordlist.txt
+++ b/ts_scripts/spellcheck_conf/wordlist.txt
@@ -1068,3 +1068,15 @@ chatGPT
 baseimage
 cuDNN
 Xformer
+MiniLM
+SIMD
+SLA
+htop
+jpeg
+libjpeg
+lightseq
+multithreading
+onnxruntime
+pipelining
+tensorRT
+utilizations

From 31369682582f73efb64bd0f67de128d192b65375 Mon Sep 17 00:00:00 2001
From: sekyonda <127536312+sekyondaMeta@users.noreply.github.com>
Date: Tue, 15 Aug 2023 15:25:06 -0400
Subject: [PATCH 3/5] Update wordlist.txt

---
 ts_scripts/spellcheck_conf/wordlist.txt | 1 +
 1 file changed, 1 insertion(+)

diff --git a/ts_scripts/spellcheck_conf/wordlist.txt b/ts_scripts/spellcheck_conf/wordlist.txt
index 91ea9dc273..1f9cec26d9 100644
--- a/ts_scripts/spellcheck_conf/wordlist.txt
+++ b/ts_scripts/spellcheck_conf/wordlist.txt
@@ -1080,3 +1080,4 @@ onnxruntime
 pipelining
 tensorRT
 utilizations
+ctranslate

From 3279a38b9118e559ad3826f1ccf9b7fb073c8400 Mon Sep 17 00:00:00 2001
From: sekyonda <127536312+sekyondaMeta@users.noreply.github.com>
Date: Wed, 16 Aug 2023 13:32:21 -0400
Subject: [PATCH 4/5] Update performance_checklist.md

---
 docs/performance_checklist.md | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/docs/performance_checklist.md b/docs/performance_checklist.md
index 89fcf3a2fe..d984ed37e4 100644
--- a/docs/performance_checklist.md
+++ b/docs/performance_checklist.md
@@ -23,17 +23,15 @@ Start model inference optimization only after other factors, the “low-hanging
 
 - Use fp16 for GPU inference.  The speed will most likely more than double on newer GPUs with tensor cores, with negligible accuracy degradation.  Technically fp16 is a type of quantization but since it seldom suffers from loss of accuracy for inference it should always be explored. As shown in this [article](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html#abstract), use of fp16 offers speed up in large neural network applications.
 
-- Use model quantization (i.e., int8) for CPU inference.  Explore different quantization options: dynamic quantization, static quantization, and quantization aware training, as well as tools such as Intel Neural Compressor that provide more sophisticated quantization methods.
+- Use model quantization (i.e. int8) for CPU inference.  Explore different quantization options: dynamic quantization, static quantization, and quantization aware training, as well as tools such as Intel Neural Compressor that provide more sophisticated quantization methods. It is worth noting that quantization comes with some loss in accuracy and might not always offer significant speed up on some hardware thus this might not always be the right approach.
 
 - Balance throughput and latency with smart batching.  While meeting the latency SLA try larger batch sizes to increase the throughput.
 
-- Try [torchscript](https://pytorch.org/docs/stable/jit.html), [inference_mode](https://pytorch.org/docs/stable/generated/torch.inference_mode.html), and [optimize_for_inference](https://pytorch.org/docs/stable/generated/torch.jit.optimize_for_inference.html). Torchscript provides tools that incrementally transition models from purely python to a torchscript program which can be run independently of python. This gives us the ability to run models in environments where python may be at a disadvantage in terms of performance.
-
 - Try optimized inference engines such as onnxruntime, tensorRT, lightseq, ctranslate-2, etc.  These engines often provide additional optimizations such as operator fusion, in addition to model quantization.
 
 - Try model distillation.  This is more involved and often requires training data, but the potential gain can be large.  For example, MiniLM achieves 99% the accuracy of the original BERT base model while being 2X faster.
 
-- Try task parallelism.  Python’s GIL could affect effective multithreading, even for external native code.  For a system with 32 vCPUs, two inference sessions each with 16 threads often have higher throughput than a single inference session with 32 threads.  When testing multiple sessions, it is important to set torch.num_threads properly to avoid CPU contention.
+- If working on CPU, you can try core pinning. You can find more information on how to work with this [in this blog post](https://pytorch.org/tutorials/intermediate/torchserve_with_ipex#grokking-pytorch-intel-cpu-performance-from-first-principles).
 
 - For batch processing on sequences with different lengths, sequence bucketing could potentially improve the throughput by 2X.  In this case, a simple implementation of sequence bucketing is to sort all input by sequence length before feeding them to the model, as this reduces unnecessary padding when batching the sequences.
 

From 465bc2ba9e7644fdfa28466de5b053a40f2c7da4 Mon Sep 17 00:00:00 2001
From: sekyonda <127536312+sekyondaMeta@users.noreply.github.com>
Date: Thu, 24 Aug 2023 15:56:33 -0400
Subject: [PATCH 5/5] Update README.md

---
 kubernetes/AKS/README.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kubernetes/AKS/README.md b/kubernetes/AKS/README.md
index 4948e10f14..99b6074fe4 100644
--- a/kubernetes/AKS/README.md
+++ b/kubernetes/AKS/README.md
@@ -291,7 +291,7 @@ az group delete --name myResourceGroup --yes --no-wait
 ```
 
 ## Troubleshooting
-  
+
 
   **Troubleshooting Azure Cli login**
 
@@ -299,11 +299,11 @@ az group delete --name myResourceGroup --yes --no-wait
   Otherwise, open a browser page at https://aka.ms/devicelogin and enter the authorization code displayed in your terminal.
   If no web browser is available or the web browser fails to open, use device code flow with az login --use-device-code.
   Or you can login with your credential in command line, more details, see https://docs.microsoft.com/en-us/cli/azure/authenticate-azure-cli.
-  
+
   **Troubleshooting Azure resource for AKS cluster creation**
-  
-  * Check AKS available region, https://azure.microsoft.com/en-us/explore/global-infrastructure/products-by-region/?products=kubernetes-service 
+
+  * Check AKS available region, https://azure.microsoft.com/en-us/explore/global-infrastructure/products-by-region/
   * Check AKS quota and VM size limitation, https://docs.microsoft.com/en-us/azure/aks/quotas-skus-regions
   * Check whether your subscription has enough quota to create AKS cluster, https://docs.microsoft.com/en-us/azure/networking/check-usage-against-limits
-  
+
   **For more AKS troubleshooting, please visit https://docs.microsoft.com/en-us/azure/aks/troubleshooting**