Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix application hang when network is lost during QoS0 publish loop #1006

Merged
merged 2 commits into from
Jun 24, 2020
Merged

Fix application hang when network is lost during QoS0 publish loop #1006

merged 2 commits into from
Jun 24, 2020

Conversation

aggarg
Copy link
Member

@aggarg aggarg commented Jun 21, 2020

Description

mbedTLS uses sockets API for network communication when running on a Linux platform. The application data sent using sockets send API does not immediately gets sent over the network but gets copied to an internal buffer in the TCP stack for later transmission. The socket send API copies data in an internal buffer of the TCP stack and returns success to the application. The data is later transmitted by the TCP stack and the internal buffer is freed only when the TCP ACK confirming the receipt of the data is received from the other end.

When the network connection is lost, the TCP stack will not be able to send any data over the network and will stop receiving any ACK from the other end. As a result, if the application continues to send data, the TCP stack's the internal buffers will keep getting consumed as no buffer will be freed by received ACKs. Note that the sockets send API will continue to return success to the application even though the data is actually not getting sent. When all the TCP internal buffers are full, the socket send API will:

  • Either block forever, if the socket is blocking.
  • Or return error if the socket is non-blocking or a send timeout is set using SO_SNDTIMEO.

Look at the following diagram:

             --------------------------------------------------
             ^           ^               ^                   ^
             |           |               |                   |
             |           |               |                   |
             +           +               +                   +
            T0          T1              T2                  T3
          Start      Start QoS0    Network Lost           TCP Queue
        Connection  Publish Loop                            Full

In the above diagram, the network connection is lost at time T2 but the application finds out only at a later time T3 when the TCP internal buffers are full.

By default, the underlying socket in mbedTLS is blocking. As a result, an application which publishes QoS0 messages in a loop may hit the condition above and appear to hang. mbedTLS provides an API, namely mbedtls_net_set_nonblock, to set the underlying socket as non-blocking which will ensure that the application gets notified of the failed send instead of hanging forever.

This change adds a config parameter AWS_IOT_MQTT_SOCKET_NON_BLOCKING which can be defined in the aws_iot_config.h file to set the underlying socket as non-blocking.

The application should use QoS1 to be able to quickly detect broken connections as opposed to relying on a failed send from the TCP stack which is dependent on the number of internal buffers in the TCP stack and network load etc. If the requirement of the user application is to use QoS0 and to eventually detect a broken connection, the newly added option AWS_IOT_MQTT_SOCKET_NON_BLOCKING can be used.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

mbedTLS uses sockets API for network communication when running on a
Linux platform. The application data sent using sockets send API does
not immediately gets sent over the network but gets copied to an
internal buffer in the TCP stack for later transmission. The socket send
API copies data in an internal buffer of the TCP stack and returns
success to the application. The data is later transmitted by the TCP
stack and the internal buffer is freed only when the TCP ACK confirming
the receipt of the data is received from the other end.

When the network connection is lost, the TCP stack will not be able to
send any data over the network and will stop receiving any ACK from the
other end. As a result, if the application continues to send data, the
TCP stack's the internal buffers will keep getting consumed as no buffer
will be freed by received ACKs. Note that the sockets send API will
continue to return success to the application even though the data is
actually not getting sent. When all the TCP internal buffers are full,
the socket send API will:

- Either block forever, if the socket is  blocking.
- Or return error if the socket is non-blocking or a send timeout is
  set using SO_SNDTIMEO.

Look at the following diagram:

             --------------------------------------------------
             ^           ^               ^                   ^
             |           |               |                   |
             |           |               |                   |
             +           +               +                   +
            T0          T1              T2                  T3
          Start      Start QoS0    Network Lost           TCP Queue
        Connection  Publish Loop                            Full

In the above diagram, the network connection is lost at time T2 but the
application finds out only at a later time T3 when the TCP internal
buffers are full.

By default, the underlying socket in mbedTLS is blocking. As a result,
an application which publishes QoS0 messages in a loop may hit the
condition above and appear to hang. mbedTLS provides an API, namely
mbedtls_net_set_nonblock, to set the underlying socket as non-blocking
which will ensure that the application gets notified of the failed send
instead of hanging forever.

This change adds a config parameter AWS_IOT_MQTT_SOCKET_NON_BLOCKING
which can be defined in the aws_iot_config.h file to set the underlying
socket as non-blocking.

The application should use QoS1 to be able to quickly detect broken
connections as opposed to relying on a failed send from the TCP stack
which is dependent on the number of internal buffers in the TCP stack
and network load etc. If the requirement of the user application is to
use QoS0 and to eventually detect a broken connection, the newly added
option AWS_IOT_MQTT_SOCKET_NON_BLOCKING can be used.

Signed-off-by: Gaurav Aggarwal <aggarg@amazon.com>
Signed-off-by: Gaurav Aggarwal <aggarg@amazon.com>
Copy link
Contributor

@abhidixi11 abhidixi11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The quality error is "Can't find spellcheck script, exiting.", may be path is incorrect , but I don't see any problem with this PR.

@nateglims
Copy link
Member

Hi, please ignore the failing checks. This target branch was mistakenly allowed in CI intended for the development branch.

@abhidixi11 abhidixi11 merged commit 237c571 into aws:master Jun 24, 2020
@aggarg aggarg deleted the qos0_publish_loop_hang branch June 26, 2020 00:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants