Skip to content

Commit

Permalink
Merge remote-tracking branch 'upstream/master' into tj/plugin/templat…
Browse files Browse the repository at this point in the history
…e/test/compare-tensor-experimental_detectron_detection_prior_grid
  • Loading branch information
t-jankowski committed May 22, 2024
2 parents 086837e + 0d95325 commit 6b31365
Show file tree
Hide file tree
Showing 171 changed files with 5,329 additions and 809 deletions.
2 changes: 1 addition & 1 deletion .github/dockerfiles/docker_tag
Original file line number Diff line number Diff line change
@@ -1 +1 @@
pr-24395
pr-24598
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
.. {#openvino_docs_ops_internal_RMS}

RMS
===


.. meta::
:description: Learn about RMS a normalization operation.

**Versioned name**: *RMS*

**Category**: *Normalization*

**Short description**: Calculates Root Mean Square (RMS) normalization of the input tensor.

**Detailed description**

*RMS* operation performs Root Mean Square (RMS) normalization on a given input ``data`` along the last dimension of the input.
`Reference <https://arxiv.org/abs/1910.07467>`__.


.. code-block:: py

(x / Sqrt(ReduceMean(x^2, -1) + eps)) * scale


**Attributes**

* *epsilon*

* **Description**: A very small value added to the variance for numerical stability. Ensures that division by zero does not occur for any normalized element.
* **Range of values**: a positive floating-point number
* **Type**: ``float``
* **Required**: *yes*

* *output_type*

* **Description**: The precision for output type conversion, after scaling. It's used for output type compression to f16.
* **Range of values**: Supported floating point type: "f16", "undefined"
* **Type**: ``string``
* **Default value**: "undefined" (means that output type is set to the same as the input type)
* **Required**: *no*


**Inputs**

* **1**: ``data`` - Input data to be normalized. A tensor of type *T* and arbitrary shape. **Required.**

* **2**: ``scale`` - A tensor of type *T* containing the scale values for . The shape should be broadcastable to the shape of ``data`` tensor. **Required.**


**Outputs**

* **1**: Output tensor of the same shape as the ``data`` input tensor and type specified by *output_type* attribute.

**Types**

* *T*: any floating point type.

**Example**

.. code-block:: xml
:force:

<layer ... type="RMS"> <!-- normalization always over the last dimension [-1] -->
<data eps="1e-6"/>
<input>
<port id="0">
<dim>12</dim>
<dim>25</dim>
<dim>512</dim>
</port>
<port id="1">
<dim>512</dim>
</port>
</input>
<output>
<port id="2">
<dim>12</dim>
<dim>25</dim>
<dim>512</dim>
</port>
</output>
</layer>
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,11 @@ On platforms that natively support half-precision calculations (``bfloat16`` or
of ``f32`` to achieve better performance (see the `Execution Mode Hint <#execution-mode-hint>`__).
Thus, no special steps are required to run a model with ``bf16`` or ``f16`` inference precision.

.. important::

The ``bf16`` floating-point precision appears to have some limitations that impact the
inference accuracy in LLM models. For more details, refer to this :ref:`article <limited_inference_precision>`.

Using the half-precision provides the following performance benefits:

- ``bfloat16`` and ``float16`` data types enable Intel® Advanced Matrix Extension (AMX) on 4+ generation Intel® Xeon® Scalable Processors, resulting in significantly faster computations on the corresponding hardware compared to AVX512 or AVX2 instructions in many deep learning operation implementations.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,18 +18,40 @@ of the weights, and it does not affect how the devices execute the model. This c
a lot of confusion where, for example, you couldn't execute a high-performance model on the GPU
by default, and the behavior between devices was different.

This guide will focus on how to control inference precision. And using lower precision is important for performance because compute bandwidth tends to be higher for smaller data types, and hardware often has special blocks for efficient multiply-accumulate operations with smaller data types only (e.g. Intel Xᵉ Matrix Extensions (XMX) on GPU and Intel Advanced Matrix Extensions (AMX) on CPU do not support ``f32``). Also, I/O operations requires less memory due to the smaller tensor byte size. This guide will focus on how to control inference precision.
This guide will focus on how to control inference precision. And using lower precision is
important for performance because compute bandwidth tends to be higher for smaller data
types, and hardware often has special blocks for efficient multiply-accumulate operations
with smaller data types only (e.g. Intel Xᵉ Matrix Extensions (XMX) on GPU and Intel
Advanced Matrix Extensions (AMX) on CPU do not support ``f32``). Also, I/O operations
requires less memory due to the smaller tensor byte size. This guide will focus on how
to control inference precision.


Execution Mode
##############

``ov::hint::execution_mode`` is a high-level hint to control whether the user wants to keep the best accuracy (**ACCURACY mode**) or if the device can do some optimizations that may lower the accuracy for performance reasons (**PERFORMANCE mode**)

* In **ACCURACY mode**, the device cannot convert floating point tensors to a smaller floating point type, so devices try to keep the accuracy metrics as close as possible to the original values ​​obtained after training relative to the device's real capabilities. This means that most devices will infer with ``f32`` precision if your device supports it.
* In **PERFORMANCE mode**, the device can convert to smaller data types and apply other optimizations that may have some impact on accuracy rates, although we still try to minimize accuracy loss and may use mixed precision execution in some cases.

If the model has been quantized using :doc:`OpenVINO optimization tools <../../model-optimization-guide/quantizing-models-post-training>` or any other method, the quantized operators will be executed with the target integer precision if the device has hardware acceleration for that type. For example, quantized ``int8`` primitives are executed with ``int8`` precision for both **ACCURACY** and **PERFORMANCE modes** if the device provides higher compute bandwidth for 8-bit data types compared to any available floating-point type. On the other hand, devices without hardware acceleration for the ``int8`` data type can keep such operators in floating point precision, and the exact floating point type will be affected by ``execution_mode`` and ``inference_precision`` properties.
``ov::hint::execution_mode`` is a high-level hint to control whether the user wants to keep
the best accuracy (**ACCURACY mode**) or if the device can do some optimizations that
may lower the accuracy for performance reasons (**PERFORMANCE mode**)

* In **ACCURACY mode**, the device cannot convert floating point tensors to a smaller
floating point type, so devices try to keep the accuracy metrics as close as possible to
the original values ​​obtained after training relative to the device's real capabilities.
This means that most devices will infer with ``f32`` precision if your device supports it.
* In **PERFORMANCE mode**, the device can convert to smaller data types and apply other
optimizations that may have some impact on accuracy rates, although we still try to
minimize accuracy loss and may use mixed precision execution in some cases.

If the model has been quantized using
:doc:`OpenVINO optimization tools <../../model-optimization-guide/quantizing-models-post-training>`
or any other method, the quantized operators will be executed with the target integer
precision if the device has hardware acceleration for that type. For example, quantized
``int8`` primitives are executed with ``int8`` precision for both **ACCURACY** and
**PERFORMANCE modes** if the device provides higher compute bandwidth for 8-bit data types
compared to any available floating-point type. On the other hand, devices without hardware
acceleration for the ``int8`` data type can keep such operators in floating point precision,
and the exact floating point type will be affected by ``execution_mode`` and
``inference_precision`` properties.

Code examples:

Expand All @@ -53,11 +75,43 @@ Code examples:
Inference Precision
###################

``ov::hint::inference_precision`` precision is a lower-level property that allows you to specify the exact precision the user wants, but is less portable. For example, CPU supports ``f32`` inference precision and ``bf16`` on some platforms, GPU supports ``f32`` and ``f16``, so if a user wants to an application that uses multiple devices, they have to handle all these combinations manually or let OV do it automatically by using higher level ``execution_mode`` property. Another thing is that ``inference_precision`` is also a hint, so the value provided is not guaranteed to be used by Runtime (mainly in cases where the current device does not have the required hardware capabilities).
``ov::hint::inference_precision`` precision is a lower-level property that allows you
to specify the exact precision the user wants, but is less portable. For example, CPU
supports ``f32`` inference precision and ``bf16`` on some platforms, GPU supports ``f32``
and ``f16``, so if a user wants to an application that uses multiple devices, they have
to handle all these combinations manually or let OV do it automatically by using higher
level ``execution_mode`` property. Another thing is that ``inference_precision`` is also
a hint, so the value provided is not guaranteed to be used by Runtime (mainly in cases
where the current device does not have the required hardware capabilities).

.. note::

All devices only support floating-point data types (``f32``, ``f16``, ``bf16``) as a value for ``inference_precision`` attribute, because quantization cannot be done in Runtime.
All devices only support floating-point data types (``f32``, ``f16``, ``bf16``) as a value
for ``inference_precision`` attribute, because quantization cannot be done in Runtime.


.. _limited_inference_precision:

Limitation of the ``bf16`` inference precision
++++++++++++++++++++++++++++++++++++++++++++++

It is important to mention that inferring FP16 and FP32 LLM models with the ``bf16`` runtime
precision may result in higher accuracy loss than the pre-determined threshold of 0.5%.
Higher accuracy drop may occur when inferring **dolly-v2-12b**, **dolly-v2-3b**, and
**gpt-neox-20b** original Pytorch models with ``bf16``, and is caused by a limited
precision representation.

To solve the issue, you might use an INT8 model and force the FP32 inference precision.
The accuracy of an INT8 model with FP32 is nearly the same as of an FP16 model with ``f32``.
Additionally, selective FP32 execution of ops on CPU plugin together with the NNCF ``bf16``
calibration could potentially mitigate the accuracy loss.

However, the solutions mentioned above would, unfortunately, also result in significant
performance drop during a large batch size inference task on machines with Intel AMX-BF16 SPR.
In such cases, the fused multiply-add operation (FMA) is used instead of AMX. Also,
in a compute-bound case, such as the LLM batch inference/serving, these workarounds
would drastically reduce the throughput by more than 60%.



Additional Resources
Expand Down
2 changes: 1 addition & 1 deletion docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ pytest-metadata==1.11.0
py>=1.9.0
pytz==2022.7
pyyaml==6.0.1
requests==2.31.0
requests==2.32.0
six==1.15.0
snowballstemmer==2.1.0
soupsieve==2.2.1
Expand Down
3 changes: 1 addition & 2 deletions docs/sphinx_setup/_static/css/custom.css
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,7 @@ a#wap_dns {display: none;}
background-repeat: no-repeat;
background-image: url("data:image/svg+xml;charset=utf8,%3Csvg xmlns='http://www.w3.org/2000/svg' viewBox='0 0 64 64' aria-labelledby='openvino github' aria-describedby='openvino github' role='img' xmlns:xlink='http://www.w3.org/1999/xlink'%3E%3Ctitle%3EGithub%3C/title%3E%3Cdesc%3EA solid styled icon from Orion Icon Library.%3C/desc%3E%3Cpath d='M32 0a32.021 32.021 0 0 0-10.1 62.4c1.6.3 2.2-.7 2.2-1.5v-6c-8.9 1.9-10.8-3.8-10.8-3.8-1.5-3.7-3.6-4.7-3.6-4.7-2.9-2 .2-1.9.2-1.9 3.2.2 4.9 3.3 4.9 3.3 2.9 4.9 7.5 3.5 9.3 2.7a6.93 6.93 0 0 1 2-4.3c-7.1-.8-14.6-3.6-14.6-15.8a12.27 12.27 0 0 1 3.3-8.6 11.965 11.965 0 0 1 .3-8.5s2.7-.9 8.8 3.3a30.873 30.873 0 0 1 8-1.1 30.292 30.292 0 0 1 8 1.1c6.1-4.1 8.8-3.3 8.8-3.3a11.965 11.965 0 0 1 .3 8.5 12.1 12.1 0 0 1 3.3 8.6c0 12.3-7.5 15-14.6 15.8a7.746 7.746 0 0 1 2.2 5.9v8.8c0 .9.6 1.8 2.2 1.5A32.021 32.021 0 0 0 32 0z' fill='rgb(255, 255, 255)'%3E%3C/path%3E%3Cpath %3E%3C/path%3E%3C/svg%3E ");
}

svg path {
.fa-square-github path {
fill: none;
}

Expand Down
4 changes: 2 additions & 2 deletions samples/cpp/build_samples.sh
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ usage() {
exit 1
}

samples_type="$(basename "$(dirname "$(realpath "${BASH_SOURCE[0]}")")")"
samples_type="$(basename "$(dirname "$(realpath "${BASH_SOURCE:-$0}")")")"
samples_build_dir="$HOME/openvino_${samples_type}_samples_build"
sample_install_dir=""

Expand Down Expand Up @@ -55,7 +55,7 @@ error() {
}
trap 'error ${LINENO}' ERR

SAMPLES_SOURCE_DIR="$( cd "$( dirname "$(realpath "${BASH_SOURCE[0]}")" )" && pwd )"
SAMPLES_SOURCE_DIR="$( cd "$( dirname "$(realpath "${BASH_SOURCE:-$0}")" )" && pwd )"
printf "\nSetting environment variables for building samples...\n"

if [ -z "$INTEL_OPENVINO_DIR" ]; then
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ if [ -n "$selftest" ] ; then
echo "||"
echo "|| Test $image / '$opt'"
echo "||"
SCRIPT_DIR="$( cd "$( dirname "$(realpath "${BASH_SOURCE[0]}")" )" >/dev/null 2>&1 && pwd )"
SCRIPT_DIR="$( cd "$( dirname "$(realpath "${BASH_SOURCE:-$0}")" )" >/dev/null 2>&1 && pwd )"
docker run -it --rm \
--volume "${SCRIPT_DIR}":/scripts:ro,Z \
--volume yum-cache:/var/cache/yum \
Expand Down
2 changes: 1 addition & 1 deletion scripts/setupvars/setupvars.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ abs_path () {
pwd -P
}

SCRIPT_DIR="$(abs_path "${BASH_SOURCE[0]}")" >/dev/null 2>&1
SCRIPT_DIR="$(abs_path "${BASH_SOURCE:-$0}")" >/dev/null 2>&1
INSTALLDIR="${SCRIPT_DIR}"
export INTEL_OPENVINO_DIR="$INSTALLDIR"

Expand Down
12 changes: 12 additions & 0 deletions src/bindings/c/docs/api_overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -181,8 +181,14 @@ typedef enum {
U1, //!< binary element type
U2, //!< u2 element type
U3, //!< u3 element type
U4, //!< u4 element type
U6, //!< u6 element type
U8, //!< u8 element type
U16, //!< u16 element type
Expand All @@ -193,6 +199,12 @@ typedef enum {
NF4, //!< nf4 element type
F8E4M3, //!< f8e4m3 element type
F8E5M3, //!< f8e5m2 element type
STRING, //!< string element type
} ov_element_type_e;
```

Expand Down
7 changes: 6 additions & 1 deletion src/bindings/c/include/openvino/c/ov_common.h
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,8 @@ typedef enum {
/**
* @enum ov_element_type_e
* @ingroup ov_base_c_api
* @brief This enum contains codes for element type.
* @brief This enum contains codes for element type, which is aligned with ov::element::Type_t in
* src/core/include/openvino/core/type/element_type.hpp
*/
typedef enum {
UNDEFINED = 0U, //!< Undefined element type
Expand All @@ -181,14 +182,18 @@ typedef enum {
I32, //!< i32 element type
I64, //!< i64 element type
U1, //!< binary element type
U2, //!< u2 element type
U3, //!< u3 element type
U4, //!< u4 element type
U6, //!< u6 element type
U8, //!< u8 element type
U16, //!< u16 element type
U32, //!< u32 element type
U64, //!< u64 element type
NF4, //!< nf4 element type
F8E4M3, //!< f8e4m3 element type
F8E5M3, //!< f8e5m2 element type
STRING, //!< string element type
} ov_element_type_e;

/**
Expand Down
1 change: 1 addition & 0 deletions src/bindings/c/src/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -209,5 +209,6 @@ struct mem_istream : virtual mem_stringbuf, std::istream {
};

char* str_to_char_array(const std::string& str);
ov_element_type_e find_ov_element_type_e(ov::element::Type type);
ov::element::Type get_element_type(ov_element_type_e type);
void dup_last_err_msg(const char* msg);
2 changes: 1 addition & 1 deletion src/bindings/c/src/ov_node.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ ov_status_e ov_port_get_element_type(const ov_output_const_port_t* port, ov_elem

try {
auto type = (ov::element::Type_t)port->object->get_element_type();
*tensor_type = (ov_element_type_e)type;
*tensor_type = find_ov_element_type_e(type);
}
CATCH_OV_EXCEPTIONS

Expand Down
8 changes: 6 additions & 2 deletions src/bindings/c/src/ov_tensor.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -19,16 +19,20 @@ const std::map<ov_element_type_e, ov::element::Type> element_type_map = {
{ov_element_type_e::I32, ov::element::i32},
{ov_element_type_e::I64, ov::element::i64},
{ov_element_type_e::U1, ov::element::u1},
{ov_element_type_e::U2, ov::element::u2},
{ov_element_type_e::U3, ov::element::u3},
{ov_element_type_e::U4, ov::element::u4},
{ov_element_type_e::U6, ov::element::u6},
{ov_element_type_e::U8, ov::element::u8},
{ov_element_type_e::U16, ov::element::u16},
{ov_element_type_e::U32, ov::element::u32},
{ov_element_type_e::U64, ov::element::u64},
{ov_element_type_e::NF4, ov::element::nf4},
{ov_element_type_e::F8E4M3, ov::element::f8e4m3},
{ov_element_type_e::F8E5M3, ov::element::f8e5m2}};
{ov_element_type_e::F8E5M3, ov::element::f8e5m2},
{ov_element_type_e::STRING, ov::element::string}};

inline ov_element_type_e find_ov_element_type_e(ov::element::Type type) {
ov_element_type_e find_ov_element_type_e(ov::element::Type type) {
for (auto iter = element_type_map.begin(); iter != element_type_map.end(); iter++) {
if (iter->second == type) {
return iter->first;
Expand Down
6 changes: 3 additions & 3 deletions src/bindings/js/docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@
-DENABLE_WHEEL=OFF \
-DENABLE_PYTHON=OFF \
-DENABLE_INTEL_GPU=OFF \
-DCMAKE_INSTALL_PREFIX=../src/bindings/js/node/bin \
-DCMAKE_INSTALL_PREFIX="../src/bindings/js/node/bin" \
..
```
- Build the bindings:
Expand All @@ -58,11 +58,11 @@
- Run tests to make sure that **openvino-node** has been built successfully:
```bash
npm run test
```
```

## Usage

- Add the **openvino-node** package to your project by specifying it in **package.json**:
- Add the **openvino-node** package to your project by specifying it in **package.json**:
```json
"openvino-node": "file:*path-to-current-directory*"
```
Expand Down
Loading

0 comments on commit 6b31365

Please sign in to comment.