Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disconnects and handling within library and samples #33

Closed
tim-nordell-nimbelink opened this issue Jul 27, 2016 · 6 comments
Closed

Disconnects and handling within library and samples #33

tim-nordell-nimbelink opened this issue Jul 27, 2016 · 6 comments

Comments

@tim-nordell-nimbelink
Copy link

I've been testing the disconnect path by pointing to a custom MQTT broker and killing the broker on the server while it's connected. With the shadow_sample, this results in the sample exiting without reconnecting despite the reconnect flag being set within the sample. (It calls
aws_iot_shadow_set_autoreconnect_status()). The output from the sample is as follows:

=======================================================================================

On Device: window state false
Update Shadow: {"state":{"reported":{"temperature":31.500000,"windowOpen":false}}, "clientToken":"c-sdk-client-id-12"}
ERROR: iot_tls_write L#274  failed
  ! mbedtls_ssl_write returned -0x50


*****************************************************************************************

ERROR: main L#261 An error occurred in the loop -1
Disconnecting
ERROR: iot_tls_write L#274  failed
  ! mbedtls_ssl_write returned -0x50


ERROR: main L#268 Disconnect error -1

The code path inside _aws_iot_mqtt_internal_yield() needs to handle the NETWORK_DISCONNECTED_ERROR error for the aws_iot_mqtt_internal_cycle_read() call, too, which is why the example application does not reconnect. The suggested change for this point is to do:

                }

                yieldRc = aws_iot_mqtt_internal_cycle_read(pClient, &timer, &packet_type);
-               if(SUCCESS != yieldRc) {
-                       break;
-               }

-               yieldRc = _aws_iot_mqtt_keep_alive(pClient);
+               if(SUCCESS == yieldRc)
+                       yieldRc = _aws_iot_mqtt_keep_alive(pClient);
+

which would pass down the NETWORK_DISCONNECTED_ERROR down from the aws_iot_mqtt_internal_cycle_read() call into the handling block for this.

After making this change, however, this exposes that the code below doesn't state transition to disconnected before the client set state call just below. To fix this, I've changed the above to:

diff --git a/src/aws_iot_mqtt_client_yield.c b/src/aws_iot_mqtt_client_yield.c
index b86a785..e80916d 100644
--- a/src/aws_iot_mqtt_client_yield.c
+++ b/src/aws_iot_mqtt_client_yield.c
@@ -201,12 +201,13 @@ static IoT_Error_t _aws_iot_mqtt_internal_yield(AWS_IoT_Client *pClient, uint32_
                }

                yieldRc = aws_iot_mqtt_internal_cycle_read(pClient, &timer, &packet_type);
-               if(SUCCESS != yieldRc) {
-                       break;
-               }

-               yieldRc = _aws_iot_mqtt_keep_alive(pClient);
+               if(SUCCESS == yieldRc)
+                       yieldRc = _aws_iot_mqtt_keep_alive(pClient);
+
                if(NETWORK_DISCONNECTED_ERROR == yieldRc) {
+                       if(clientState != CLIENT_STATE_DISCONNECTED_ERROR)
+                               _aws_iot_mqtt_handle_disconnect(pClient);
                        pClient->clientData.counterNetworkDisconnected++;
                        if(1 == pClient->clientStatus.isAutoReconnectEnabled) {
                                yieldRc = aws_iot_mqtt_set_client_state(pClient, CLIENT_STATE_DISCONNECTED_ERROR,

I'm still digging through the issues in the codebase around the reconnects to understand how the codebase is supposed to fit together, but thought I'd open this issue as it's quite evident with the shadow_sample and closing the connection from the server side.

The above doesn't work by itself as it exposes the following issue with the codebase, too, relating to reconnecting:

The function call path:

 -> _aws_iot_mqtt_handle_disconnect()
 -> aws_iot_mqtt_disconnect()
 -> _aws_iot_mqtt_internal_disconnect
 -> pClient->networkStack.destroy() (which points to iot_tls_destroy())

is problematic since the only corresponding iot_tls_init() call is within the aws_iot_mqtt_init() call. An example problem point is within _aws_iot_mqtt_keep_alive() which invokes this path, and returns expecting the caller to attempt to reconnect, but technically the caller will be operating on a destroy'd handle.)

@chaurah
Copy link
Contributor

chaurah commented Jul 28, 2016

Hi @tim-nordell-nimbelink,
You bring up a valid concern here. We were able to replicate the issue on our end. The return paths from aws_iot_mqtt_internal_cycle_read need to be analyzed properly to make sure this issue doesn't occur. The client should transition to disconnected with error state if tls read and write fail. We will go through the code again to ensure that behavior is handled correctly. We have included some Integration tests along with the SDK, one of which also tests for Auto-reconnect. Those might be helpful to you as you go through the code.

The second problem that you mention will not occur. Yes, iot_tls_destroy will destroy the handles. But they are reinitialized in iot_tls_connect as opposed to iot_tls_init. As I remember it, the reasoning here was that if I didn't destroy the handles, reconnect failed with mbedtls giving segfaults. This was the cleanest way to make reconnect work without causing issues with mbedtls.

We will update the issue as soon as we have figured out a solution. These are really helpful suggestions. If you discover anything else while you are going through the code please do let us know. Thank you for using AWS IoT.

Rahul

@tim-nordell-nimbelink
Copy link
Author

Rahul -

I discovered this while trying to port the library to an embedded environment running a STM32L4xx series microcontroller, and verified the behavior was the same in the shadow example I had ported.

The second problem is a problem according to the documented interfaces, and would be unanticipated behavior by someone porting the system. (I hit this when porting it.) The iot_tls_init(..) documentation states:

 * Perform any initialization required by the TLS layer.
 * Connects the interface to implementation by setting up
 * the network layer function pointers to platform implementations.

And iot_tls_destroy(...) states:

/**
 * @brief Perform any tear-down or cleanup of TLS layer
 *
 * Called to cleanup any resources required for the TLS layer.

The functions feel like they're paired up as:

  • iot_tls_init(...)/iot_tls_destroy(...)
  • iot_tls_connect(...)/iot_tls_disconnect(...)
  • iot_tls_read(...)/iot_tls_write(...)

Since the iot_tls_init(...) call is akin to a constructor, and the iot_tls_destroy(...) call is akin to a destructor, one shouldn't expect any of the other calls to the module to be called after iot_tls_destroy(...). I certainly didn't, and it caused a null dereference in my porting attempt. I see now that the sample platform port behaves differently from the anticipated behavior from purely looking at network_interface.h. If this is expected behavior from a given port, please add a note in the documentation for the iot_tls_destroy(...) function that one can expect iot_tls_connect(...) to be called after without another iot_tls_init(...) call.

Going back to the reconnect issue, the codebase is inconsistent with how it returns the error codes from iot_tls_read(...) and iot_tls_write(...). 3 of the invocations of iot_tls_read(...) will return back out the error code returned by that function call out to the callees of the parent functions - one always returns FAILURE whenever iot_tls_read(...) returns a failure of any sort. The only invocation of iot_tls_write(...) returns FAILURE from its function when an error had occurred. This inconsistency probably doesn't help with the disconnect being reported from within the iot_tls_write(...) or iot_tls_read(...) calls. As feedback, it'd be a good idea to document the expected return values from iot_tls_xxx() calls, and under what circumstances each error should be reported.

Side note: The sample platform implementation doesn't return a NETWORK_DISCONNECTED_ERROR from the iot_tls_write/iot_tls_read calls upon a disconnect, but it seems like some of the codebase will attempt to react to this if they do return this, but not all of the code paths.

-- Tim

@tim-nordell-nimbelink
Copy link
Author

Further note about the interfaces for iot_tls:

If I had been someone just looking at the interface by itself, I would have assumed that the following is a valid usage:

iot_tls_init(...);
iot_tls_connect(...);
...
iot_tls_read(...);
iot_tls_write(...);
...
iot_tls_disconnect(...);
iot_tls_connect(...);
...
iot_tls_read(...);
iot_tls_write(...);
...
iot_tls_disconnect(...);
iot_tls_destroy(...);

E.g. being able to disconnect and reconnect without invoking destroy(...), and without leaking. The sample implementation will probably leak with the above.

-- Tim

@tim-nordell-nimbelink
Copy link
Author

tim-nordell-nimbelink commented Jul 28, 2016

I've created a pull request for fixing this. I haven't thoroughly tested it, although I have verified the existing auto reconnect unit test still passes. I did not create a test case for what this fixes in the unit tests (although that'd be a good idea to do).

Each commit is changing a single chunk of logic in the code so that it's hopefully easier to follow - it shouldn't create any commit by itself that won't work.

@chaurah
Copy link
Contributor

chaurah commented Jul 29, 2016

Hi Tim,
This is great. We incorporate pull request by making the changes on our side and including them in the next release. We will definitely include this change. I did test the pull request and it is a great starting point for making the TLS layer interactions more robust. You are correct about the documentation issues with the tls implementation as well. We will add further documentation to the porting guide as well as the API docs to make the expected usage clearer.

Please feel free to point out any further issues you notice. This is extremely helpful information.

Rahul

@chaurah
Copy link
Contributor

chaurah commented Sep 19, 2016

Fixed in v2.1.1 of the SDK. Closing.
If there are any further issues please let me know.

Rahul

@chaurah chaurah closed this as completed Sep 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants