TLS support + peers protocol #59

wallin · 2018-02-26T05:00:00Z

Implement support for TLS + use new peers protocol

jhecking · 2018-02-26T07:38:18Z

Thank you! Haven't done a full review yet, but looks good at first glance.

Because of the dependency on the openssl gem, we would have to drop support for Ruby 2.2. But since official support for that version ends end of March anyway, that's probably ok.

wallin · 2018-02-26T20:30:12Z

Thanks @jhecking. Dropping Ruby 2.2 seems reasonable

I tried to align the tls parameters names with the python client
https://github.com/aerospike/aerospike-client-python/blob/master/doc/aerospike.rst

Let me know which ones would be prioritized to support in a first release.

jhecking

The TCP/TLS socket connection implementation looks good to me in principal. That said, these changes alone won't be enough to actually connect to a TLS enabled cluster.

The problem is that the Ruby client still uses the older "services" info protocol for node discovery. For TLS support, the newer "peers" protocol is required. The two protocols are similar and both serve the same purpose of allowing the client to discover all the nodes in the cluster. But only the "peers" protocol contains the necessary information the client needs to establish a secure connection, e.g. the port number on which the server accepts TLS connections.

The "peers" protocol also includes the server's "tlsname", which the client generally uses to validate the certificate presented by the server, rather than using the server's hostname/IP address.

To get an idea of what it would take to implement the "peers" protocol, you can take a look at the changes to the Java client, that were done to support TLS as well as the "peers" protocol: aerospike/aerospike-client-java@5b5b06b. The Ruby client was largely modelled after the Java client so the overall structure is quite similar.

jhecking · 2018-02-27T04:13:08Z

lib/aerospike/socket/ssl.rb

+
+      def set_cert(ctx, options)
+        if options[:cert_file]
+          ctx.cert = OpenSSL::X509::Certificate.new(File.open(options[:cert_file]))


Should we use SSLContext#add_certificate instead of the cert= and key= accessor methods?

Good catch. Will remove this deprecated setter in favor of add_certificate

jhecking · 2018-02-27T04:17:13Z

lib/aerospike/socket/ssl.rb

+
+      def set_cipher_suite(ctx, options)
+        if options[:cipher_suite]
+          # TODO(wallin)


I think we can just use SSLContext#ciphers=:

ctx.ciphers = options[:cipher_suite]

jhecking · 2018-02-27T04:23:40Z

lib/aerospike/socket/ssl.rb

+
+      def set_protocols(ctx, options)
+        if options[:protocols]
+          # TODO(wallin)


I don't think we can support a list of protocol versions. The best we can do with Ruby's OpenSSL library is to set a minimum and/or maximum protocol version using SSLContext#min_version and SSLContext#max_version.

Yes, makes sense

Also, we should default min_version to OpenSSL::SSL::TLS1_2_VERSION. (By default, the server only supports TLS v1.2.)

jhecking · 2018-02-27T04:29:52Z

lib/aerospike/socket/ssl.rb

+      attr_reader :context
+
+      def initialize(host, port, timeout, ssl_options)
+        @host, @port, @timeout = host, port, timeout


Should this initialization happen in the Connection class itself? Since it's the same between the TCP and SSL socket implementations and the accessor methods are defined in Connection class as well?

👍 I'll refactor this

jhecking · 2018-02-27T08:03:33Z

lib/aerospike/socket/ssl.rb

+      def verify_certificate!(socket)
+        return unless context.verify_mode == OpenSSL::SSL::VERIFY_PEER
+        return if OpenSSL::SSL.verify_certificate_identity(
+          socket.peer_cert, host


Should not use the server's host name here to verify the certificate presented by the server, but instead use a separate "tlsname". I'll comment more on this in the review summary.

wallin · 2018-02-27T16:45:16Z

Thanks @jhecking. Will have a look at the "peers" protocol, and make the appropriate changes.

jhecking · 2018-02-28T06:59:32Z

lib/aerospike/socket/ssl.rb

+      private
+
+      def create_context(options)
+        OpenSSL::SSL::SSLContext.new().tap do |ctx|


Another team member suggested, that it might be best if we just let the user pass in a pre-configured SSLContext instance, instead of creating one ourself and exposing a subset of it's config options through our own settings. I like that idea, as it gives the user full flexibility to configure the context as needed without us having to expose every single config option through our own settings. On the other hand it does require a bit more work on the user's part for a standard setup and means we can no longer set sensible defaults, like restricting the connection to TLS v1.2+.

So, how about we allow the user to pass in a pre-configured context optionally, and create one if not. E.g. just add another :context option to the ssl_options hash. If not provided, we continue to create a new context as you have already implemented. But in that case it would be ok to not expose more advanced options such as setting the desired cipher suites, etc.

What do you think?

@jhecking that's a great idea, to have both options. By allowing people to pass a custom context we'd potentially avoid a bunch of future requests asking for less common options. And as a lazy Ruby developer I like the simple option of just quickly enable SSL with sane defaults

In your experience, which options are the most commonly used by AS users? Anything besides the ones in this PR?

In your experience, which options are the most commonly used by AS users? Anything besides the ones in this PR?

I think we should keep it to the minimum needed to get a working connection to a standard server deployment, with sensible defaults for everything else. I would support:

:enable

:ca_file, :ca_path

:cert_file

:key_file, :key_file_pass_phrase

You can remove the :cipher_suite and :protocols options and just default SSLContext#min_version= to OpenSSL::SSL::TLS1_2_VERSION.

wallin · 2018-03-05T01:57:50Z

@jhecking I've come a bit on the way on the peer protocol. Could you just provide me with a couple of examples of, more or less complex, responses from the peer command? I tried looking for test cases for examples but couldn't find any. Some examples with combinations of multiple hosts, tls and no tls, IPv6, IPv4, would be great so I can create a complete parser

jhecking · 2018-03-05T03:22:58Z

That's good to hear!

Here are a couple of example responses from a 3 node cluster. For reference, this is what the server side network services config looks like:

service {
    address 192.168.33.10
    address 10.0.2.15
    port 3000
    access-address 192.168.33.10
    alternate-access-address 10.0.2.15
    tls-port 3044
    tls-name aerospike
    tls-authenticate-client any
}

Nodes 2 and 3 use ports 3100/3144 and 3200/3244 respectively.

Here is the full response for the 4 peers-* commands:

> pp client.request_info 'peers-clear-std', 'peers-clear-alt', 'peers-tls-std', 'peers-tls-alt'
{"peers-clear-std"=>
  "3,3100,[[C1D4DC08D270008,,[192.168.33.10]],[C814DC08D270008,,[192.168.33.10:3200]]]",
 "peers-clear-alt"=>
  "3,3100,[[C1D4DC08D270008,,[10.0.2.15]],[C814DC08D270008,,[10.0.2.15:3200]]]",
 "peers-tls-std"=>
  "3,3144,[[C1D4DC08D270008,aerospike,[192.168.33.10]],[C814DC08D270008,aerospike,[192.168.33.10:3244]]]",
 "peers-tls-alt"=>
  "3,3144,[[C1D4DC08D270008,aerospike,[10.0.2.15]],[C814DC08D270008,aerospike,[10.0.2.15:3244]]]"}

Note that the server will omit the default port 3000 in the responses.

Here is the same response with TLS disabled on the cluster:

> pp client.request_info 'peers-clear-std', 'peers-clear-alt', 'peers-tls-std', 'peers-tls-alt'
{"peers-clear-std"=>
  "5,3000,[[C814DC08D270008,,[192.168.33.10:3200]],[BB94DC08D270008,,[192.168.33.10]]]",
 "peers-clear-alt"=>
  "5,3000,[[C814DC08D270008,,[10.0.2.15:3200]],[BB94DC08D270008,,[10.0.2.15]]]",
 "peers-tls-std"=>"5,,[]",
 "peers-tls-alt"=>"5,,[]"}

And now with TLS re-enabled but without alternate-access-address:

> pp client.request_info 'peers-clear-std', 'peers-clear-alt', 'peers-tls-std', 'peers-tls-alt'
{"peers-clear-std"=>
  "5,3000,[[C1D4DC08D270008,,[192.168.33.10:3100]],[BB94DC08D270008,,[192.168.33.10]]]",
 "peers-clear-alt"=>"5,,[]",
 "peers-tls-std"=>
  "5,3044,[[C1D4DC08D270008,aerospike,[192.168.33.10:3144]],[BB94DC08D270008,aerospike,[192.168.33.10]]]",
 "peers-tls-alt"=>"5,,[]"}

Unfortunately, I don't have IPv6 setup for testing on my dev environment and I'm having a bit of a problem trying to set it up. So I can't provide an example for IPv6 at the moment. But I believe it should look the same, only with the IPv4 addresses replaced with IPv6 addresses. Only I am not sure if the IPv6 addresses are enclosed in an extra set of square brackets or not.

wallin · 2018-03-06T02:12:46Z

Thanks @jhecking. I'll try and have something ready by next weekend.

As for IPv6, it seems like you need to enclose it in brackets, judging by this comment:
https://github.com/aerospike/aerospike-client-java/blob/e82206496cb3d33222315fb978f7c21ef63251d1/client/src/com/aerospike/client/Host.java#L80-L91

codecov-io · 2018-03-10T23:10:43Z

Codecov Report

Merging #59 into master will increase coverage by 0.86%.
The diff coverage is 97.47%.

@@            Coverage Diff             @@
##           master      #59      +/-   ##
==========================================
+ Coverage   93.91%   94.77%   +0.86%     
==========================================
  Files          98      140      +42     
  Lines        7375     8600    +1225     
==========================================
+ Hits         6926     8151    +1225     
  Misses        449      449

Impacted Files	Coverage Δ
lib/aerospike/value/value.rb	`100% <ø> (ø)`	⬆️
spec/aerospike/index_spec.rb	`100% <ø> (ø)`	⬆️
spec/spec_helper.rb	`100% <100%> (ø)`	⬆️
spec/aerospike/cluster/find_node_spec.rb	`100% <100%> (ø)`
lib/aerospike/aerospike_exception.rb	`100% <100%> (ø)`	⬆️
lib/aerospike/node/refresh/info.rb	`100% <100%> (ø)`
lib/aerospike/node/verify/name.rb	`100% <100%> (ø)`
spec/aerospike/socket/base_spec.rb	`100% <100%> (ø)`
lib/aerospike/socket/tcp.rb	`100% <100%> (ø)`
spec/aerospike/node_spec.rb	`100% <100%> (ø)`
... and 96 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bc72c30...bf73d4e. Read the comment docs.

wallin · 2018-03-11T01:43:25Z

@jhecking I've modified to code to handle the peers protocol as well as refactored some of the code into smaller service modules. I'm going to add more unit tests, but the PR is ready for another review on a conceptual level I believe

jhecking · 2018-03-11T01:54:32Z

Sebastian, that's great! I'm traveling and have limited internet connectivity over the next few days. I will provide feedback by Thursday evening SGT. Thanks again for the work you are putting into this and esp. for doing some (much needed) code cleanup and refactoring!

…

On 11 Mar 2018, at 8:43 AM, Sebastian Wallin ***@***.***> wrote: @jhecking I've modified to code to handle the peers protocol as well as refactored some of the code into smaller service modules. I'm going to add more unit tests, but the PR is ready for another review on a conceptual level I believe — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

wallin · 2018-03-11T06:17:26Z

Thanks @jhecking. That will hopefully give me time to finish up the last pieces. Safe travels!

jhecking · 2018-03-15T03:15:08Z

@wallin, are you still planning to do a lot of changes or does it make sense for me to start reviewing the code so far now?

jhecking · 2018-03-15T03:33:27Z

@wallin, I see there are still a few test failures and a large number of tests that seem to be skipped due to some issues with detecting whether certain features are supported by the server. Anything in particular I can help you with?

wallin · 2018-03-15T03:33:43Z

@jhecking The major changes, peers protocol + TLS, are ready for review. Some notes

The TLS implementation hasn't been tested on an actual real server setup.
A fully custom SSLContext object cannot be passed yet, but this is trivial to add
IPv6 host parsing hasn't been implemented yet, but this is also a trivial task
I need help verifying that adding/removing nodes is picked up by the client, and that eg. partition map is updates correctly.
I suspect some old code can be removed but haven't made a huge effort investigating this, mostly focusing on adding more and extracting existing.

Looking forward to your feedback!

wallin · 2018-03-15T03:38:27Z

@jhecking Will look into why they are failing, but unfortunately can't continue until tomorrow night (PST). In the mean time, if you have time, please feel free to point out any errors you may find

wallin · 2018-03-15T04:05:33Z

@jhecking quick update. I believe I fixed the issue related to feature detection. Now remaining is the batch and udf failures

jhecking · 2018-03-15T04:49:08Z

Cool, thanks! I will take a look at that and also test the TLS connection and cluster-change scenarios. Cheers, Jan

…

On 15 Mar 2018, at 12:05 PM, Sebastian Wallin ***@***.***> wrote: @jhecking quick update. I believe I fixed the issue related to feature detection. Now remaining is the batch and udf failures — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

jhecking

Phew, this took longer than expected to review and I haven't even looked at the specs (much) yet. :-)

Overall, I am very happy with the changes you implemented! Thanks again for taking the time to do some refactoring and code clean-up along the way!

One thing I am not totally convinced of yet, is what value there is in pulling out a bunch of business logic into small (tiny in some cases!) service modules. Especially if most of what the service does is to manipulate the state of some other object.

The Node::Verify::PartitionGeneration service is a good example. Other than "parsing" the partition-generation info value, the service mainly reads and updates a Node's partion_generation and partition_changed values. This couples the service very tightly to the Node implementation. E.g. the service needs to be aware that these values are Atomic, which could change at some point.

If the main goal is to move business logic out of the Node (and Cluster, etc.) classes, so that these classes become smaller, maybe a better approach would be to create separate classes for concepts like "partition generation" instead. A hypothetical PartitonGeneration class could encapsulate the actual partition generation value as well as the concept of whether the partition generation has changed (currently represented in Node#partition_changed). It could have two methods update, to set the new value and the changed flag as needed, and confirm (better name?), to reset the changed flag.

This might make testing easier in some cases as well. Some of your service module specs rely a lot on test doubles and stubbing, which can lead to brittle tests as the code base evolves.

But since you are the one doing most of the work, I'll leave it to you to consider this alternative approach.

I haven't had much time to do actual testing of the changes today. Will continue working on that tomorrow.

jhecking · 2018-03-15T05:30:12Z

lib/aerospike/cluster.rb

+    end
+
+    def tls_enabled?
+      (ssl_options || {}).key?(:enable)


This would return true if ssl_options includes a key :enable with a false value. But I would expect tls_enabled? to return false if I set ssl_options = { enable: false }.

Good catch!

jhecking · 2018-03-15T05:37:12Z

lib/aerospike/cluster.rb

    def tend
      nodes = self.nodes
-      cluster_config_changed = false


I think for clarity we should still initialize cluster_config_changed to false in case we don't go into any of the branches where it is being set to true.

jhecking · 2018-03-15T05:40:40Z

lib/aerospike/cluster.rb

-        node.reference_count.value = 0
-        node.responded.value = false
+        node.reset_reference_count!
+        node.partition_changed.value = false


Maybe add another Node#reset_partition_changed! method? Or combine it with Node#reset_reference_count! - though I'm struggling to come up with a good name for such a combined method...

jhecking · 2018-03-15T06:08:24Z

lib/aerospike/cluster.rb

-          end
-        end
+      # refresh all known nodes
+      nodes.each { |node| Node::Refresh::Info.(node, peers) }


Just a stylistic note: I would like to stick to the convention to use { ... } if the primary purpose of a block is to return a value and to use do ... end if the primary purpose of a block is its side-effects. I know it's just a preference and that the existing code does not always follow this convention, but I would like for new code to follow this "Weirich convention".

I was not aware of this "Weirich convention", but I definitely makes sense. Thanks for the link. I've mostly tried to apply the "line count" convention.

jhecking · 2018-03-15T07:32:16Z

lib/aerospike/connection/create.rb

+    module Create
+      class << self
+        def call(host, port, timeout: 30, tls_name: nil, ssl_options: {})
+          if !ssl_options.nil? && ssl_options[:enable] == true


I think we should default ssl_options[:enable] to true and only fall back to a non-encrypted TCP connection if it is set to false explicitly. The primary use-case for ssl_options[:enable] is if someone has configured a TLS connection but wants to temporarily disable TLS, e.g. for debugging, without changing any of the TLS setting itself.

I understand the "default on" TLS option, but could you clarify the use case with an example. I don't see the issue of just flipping the enable option while still keeping the rest of the configuration

I think I just didn't express myself clearly. Having TLS to "default on" and providing an easy switch to temp. disable TLS is exactly what I meant. I.e. it should be possible to set up TLS without explicitly setting enable to true:

ssl_options = { cert: ..., key: ..., ca_file: ... } policy = new Aerospike::ClientPolicy(ssl_options: ssl_options) client = new Aerospike::Client(policy: policy)

Then, if I need to temp. disable TLS, I can just set enable to false without affecting the rest of the code:

ssl_options[:enable] = false

got it, I like that. So if I understand correctly TLS should be enabled as long as ssl_options is not nil, and then explicitly disabled with enable: false option?

jhecking · 2018-03-15T09:56:28Z

lib/aerospike/node/refresh/partitions.rb

+            return unless should_refresh?(node, peers)
+
+            conn = node.tend_connection
+            node.cluster.update_partitions(conn, node)


If we are already pulling out most of the partitions refresh logic into it's own service module, then I think we can move some more logic from Cluster#update_partitions here. Maybe the partition tokenizer (old|new) should be initialized here and then passed into Cluster#update_partitions instead.

jhecking · 2018-03-15T09:58:10Z

lib/aerospike/node/refresh/peers.rb

+      module Peers
+        class << self
+          def call(node, peers)
+            return if node.failures.value > 0 || !node.active?


Should move this to a should_refresh? method and use node.failed?, analog to Refresh::Partitions.

jhecking · 2018-03-15T10:02:38Z

lib/aerospike/node/refresh/peers.rb

+                    ::Aerospike.logger.warn("Peer node #{peer.node_name} is different than actual node #{nv.name} for host #{host}");
+                    # Must look for new node name in the unlikely event that node names do not agree.
+                    # Node already exists. Do not even try to connect to hosts.
+                    break if Cluster::FindNode.(node.cluster, peers, nv.name)


Need to set node_validated to true if we find the node?

missed this. Thanks

jhecking · 2018-03-15T10:47:48Z

lib/aerospike/node/verify/partition_generation.rb

+
+            generation = gen_string.to_i
+
+            if node.partition_generation.value != generation


Maybe this whole block should be moved into a new Node#update_partition_generation(generation) method? Or create a separate PartitionGeneration class altogether - more on this later in the review summary...

Yes, I like this

jhecking · 2018-03-15T10:56:33Z

lib/aerospike/node_validator.rb

+                    Resolv.getaddresses(host.name)
+                  end
+
+      @aliases = [].tap do |aliases|


If you are already rewriting this for stylistic reasons, why not go all the way? :-)

@aliases = addresses.map { |addr| Host.new(addr, host.port, host.tls_name) }

of course! :)

khaf · 2018-03-15T13:26:00Z

This looks really good. Thanks for your great work.

In addition to @jhecking 's concerns, there are a few more issues to consider:

This library is used by others for DevOps; It should remain /mostly/ backwards compatible.
A lot of peculiarities in the library were due to performance considerations. As part of validation process, we should benchmark the new code to make sure there is no major performance degradation.
If code organization is changed too much, we will have trouble maintaining and developing the library in the future, since same changes in our other libraries will look completely different in this one. While this should not be a major concern regarding making progress here, it is still important, and will make adding new features much less time intensive.

jhecking · 2018-03-15T14:58:15Z

Hi @khaf, thanks for your feedback!

I agree that the first 2 points you raise are important. But I do not think there is a particular issue here for the set of changes done by @wallin.

There are no backward compatible API changes; in fact there are no API changes at all except for the addition of ssl_options to the client policy. The only consideration is whether we want to drop support for Ruby v2.2. I still think that's ok but I have also found a way to make the openssl gem dependency optional, since Ruby v2.2 still includes the OpenSSL library in its standard library.
Will definitely run the benchmarks. But most of the changes are not in the hot path and only affect the tend operation. (Enabling TLS will obviously have some performance impact, off course.)

I am less concerned about the 3 point regarding code organization. If we truly consider this client community supported, then adhering to the particular code organization of another Aerospike language SDK should be of less importance than making it easy for other Ruby developers to contribute to the client.

wallin · 2018-03-15T17:47:09Z

Thanks @jhecking! I know it's a (too) big PR with a lot of changes. I'll address your feedback in the upcoming days

One thing I am not totally convinced of yet, is what value there is in pulling out a bunch of business logic into small (tiny in some cases!) service modules. Especially if most of what the service does is to manipulate the state of some other object.

The idea with the service modules it to extract non-trivial operations on models (such as eg. a Node). Essentially moving towards separation of concerns and SRP which also makes operations easier to test. Personally, I've found this a good way of creating maintainable code base that is also easy to navigate. But I think we're on the same page on this philosophy generally speaking.

That said. I will admit that I'm not fully content with how it's currently organised. Being new to AS in general, I was basically figuring out how everything works while implementing this, so I don't feel that I don't have a full holistic understanding of all the concepts. The service modules are implemented very much like the corresponding Java method, which is not always the right abstraction like you say.

I'm not a huge fan of how the Java implementation is organised either, but I didn't want to deviate too much from it in this first implementation. Eg. I think your idea with the PartitionGeneration class makes a lot of sense, but I'd prefer to do that in a future PR.

My focus with this PR was first and foremost to get the functionality in place while also setting a direction for organising the code, without changing to much at the same time. You can expect more PRs from me after this one, however I promise they will be smaller and more isolated :)

E.g. the service needs to be aware that these values are Atomic, which could change at some point.

I don't like this either. In upcoming PRs I want to improve the ergonomics of this by creating a mixin for declaring atomic attributes. eg. attr_atomic :features which would created helpers and hide the complexity of working with this. What do you think?

This might make testing easier in some cases as well. Some of your service module specs rely a lot on test doubles and stubbing, which can lead to brittle tests as the code base evolves.

I see your point but IMHO the purpose of unit tests is to test functionality and behavior in isolation, which is why I think as much as possibly should be stubbed and mocked. The idea is that tests should break as the code base evolves. As the abstractions become more clear, I imaging the tests will also improve.

wallin · 2018-03-15T17:59:22Z

@khaf @jhecking

Echoing what @jhecking said. No changes to the external API, so we should be ok. Otherwise, I think it's a fair requirement to require a non-deprecated Ruby version. Generally, if backwards compatibility is a strict requirement, we should create a separate release branch for that imho.
I've used the benchmark in examples folder with no notable difference in performance. I've also run more in depth profiling with ruby-prof, stackprof and allocation-tracer, also without notable differences. However, I did notice a couple of places that can be optimized, mostly related to allocations, but also there is a lot of string manipulation that probably can be avoided. I will submit separate PRs for these findings

I agree with @jhecking on 3. I appreciate the importance of providing consistent SDKs, but mostly from an API perspective. Eg. IMHO the Java internal implementation is not exemplary, so I don't think that structure is necessarily worth preserving for the sake of consistency

jhecking · 2018-03-16T03:42:27Z

The idea with the service modules it to extract non-trivial operations on models (such as eg. a Node). Essentially moving towards separation of concerns and SRP which also makes operations easier to test. Personally, I've found this a good way of creating maintainable code base that is also easy to navigate. But I think we're on the same page on this philosophy generally speaking.

Yes! We agree on the goals. I'm just not sure whether the service modules -- in the current form anyway -- are necessarily the best way to achieve those goals.

In many cases, the current service modules are not much more than a single method of the Node or Cluster module moved into a separate file. They are often closely coupled to the model since they directly manipulate its internal state. You don't really achieve separation of concerns this way because the model and the service(s) share responsibility for a single concern. E.g. the Node model and the Node::Verify::PartitionGeneration service are both responsible for keeping track of the partition generation and whether it has changed: Node keeps the state but the PartitionGeneration service updates it. It would be better if both the state as well as the methods that manipulate it would be extracted to a separate domain model altogether. Node and this new domain model should communicate only via passing messages but should be unaware of each other's internal state.

But for now, let's focus on completing the implementation of the peers protocol and TLS support. I'm looking forward to further PRs from you to address these design issues. :-)

jhecking · 2018-03-16T03:59:05Z

In upcoming PRs I want to improve the ergonomics of this by creating a mixin for declaring atomic attributes. eg. attr_atomic :features which would created helpers and hide the complexity of working with this. What do you think?

That's not what I would focus on. I actually doubt that all of these values need to be Atomic in the first place. Most of them are only accessed from the single tend thread. Likewise, I think we can probably do away with the mutex lock used by Cluster. Its main purpose is to prevent get/put/... commands from accessing the partition table while it's being updated by the tend thread. But there are better strategies to do that using copy-on-write. You'll note that the Java client hardly uses any atomics nor locks for performance reasons.

jhecking · 2018-03-16T04:00:37Z

I've used the benchmark in examples folder with no notable difference in performance.

I believe you mean the benchmark in the tools folder? The examples do not include any benchmarks.

jhecking · 2018-03-16T07:51:06Z

Gemfile

 gem 'msgpack-jruby', :require => 'msgpack', :platforms => :jruby
 gem 'msgpack', '~> 1.0', :platforms => [:mri, :rbx]
 gem 'bcrypt'
+gem 'openssl', platforms: :mri


I'm ok to remove Ruby 2.2 from the CI support matrix, but I don't want to break compatibility with older Ruby releases unnecessarily. So can we change this to:

install_if -> { Gem::Version.new(RUBY_VERSION) >= Gem::Version.new('2.3.0') } do gem 'openssl', platforms: :mri end

This will allow the aerospike gem to still be installed on Ruby v2.2 and older. Even TLS encryption might still work since the standard library includes a version of the OpenSSL library anyway. That version probably does not support the latest TLS versions, new ciphers, etc. So upgrading to a later Ruby version is still highly recommended but at least the client is still usable on older Ruby versions.

On a similar note. What's your stance on supporting JRuby? I think it can be done without too much effort, but we can address that in a separate PR if you think it's a priority

jhecking · 2018-03-16T09:15:36Z

lib/aerospike/socket/ssl.rb

+        tcp_sock = TCP.connect(host, port, timeout)
+
+        ctx = OpenSSL::SSL::SSLContext.new
+        ctx.set_params(ssl_options) if ssl_options && !ssl_options.empty?


This will fail if ssl_options includes any keys for which there are no corresponding assignment methods on SSLContext. E.g. if ssl_options includes a key :enable, this will raise an error because set_params will attempt to call ctx.enable=.

I think we should revert back to what you had before and support only a few keys in ssl_options that we use to call ctx.add_certificate. For anything more complicated than that we allow the user to pass in a fully configured SSLContext instance.

Another option could be to only let ssl_options only represent options that can be passed to initialize the SSLContext skip the enable flag all together. If you temporarily want to disable SSL, you can just pass nil instead of the options. So:

if ssl_options is a Hash, we create a new context and call set_options

if ssl_options already is an SSLContext we just use that.

In a sense I think this is cleaner rather than maintaining a set of own options, because we can just refer to the official documentation for SSLContext. What do you think?

jhecking · 2018-03-16T09:26:14Z

lib/aerospike/socket/ssl.rb

+        ssl_sock = new(tcp_sock, ctx)
+        ssl_sock.hostname = tls_name
+        ssl_sock.connect
+        ssl_sock.post_connection_check(host)


This needs to check using tls_name instead of host.

indeed. good catch

jhecking · 2018-03-16T09:40:38Z

lib/aerospike/node/refresh/peers.rb

+
+              peer.hosts.each do |host|
+                begin
+                  nv = NodeValidator.new(node.cluster, node.host, node.cluster.connection_timeout, node.cluster.ssl_options)


You are missing a cluster_name param here.

wallin · 2018-03-19T20:29:29Z

Yes, that would be good. But if we keep growing the scope of this PR it'll never get done. :-)

My point exactly :) I just wanted to know if the current/old implementation was acceptable

If we just pass ssl_options straight through to SSLContext#set_params then I don't really see the point of having that option at all.

Point taken. I'll leave it as is, and we can extend and improve it in future PRs

I don't see a reason why we should try to keep the connection alive even longer

We can leave this as is for now. The reason I ask is that I see many other DB drivers offering configuration for this

you are also trying to use Cluster::FindNode in Node::Refresh::Friends and that's where the issue is.

Got it! Will fix this

jhecking · 2018-03-20T03:48:09Z

My point exactly :) I just wanted to know if the current/old implementation [of find_nodes_to_remove] was acceptable

I'm ok either way: We can leave this for a separate PR or update to the new algorithm right away. The new implementation is definitely better!

We can leave this as is for now. The reason I ask is that I see many other DB drivers offering configuration for [TCP keep-alive]

Noted. I'll check with our other client devs what their views are on enabling TCP keep-alive.

jhecking · 2018-03-20T05:25:17Z

Re. TCP keep-alive: I forgot that the server closes idle connections after 60 seconds, so TCP keep-alive is not so important. Most other Aerospike clients also close idle connections, e.g. the Java client has a maxSocketIdle limit of 55 seconds by default.

wallin · 2018-03-20T16:16:02Z

@jhecking sounds good. I'll leave everything as is so we can get this PR merged, and then go ahead and create issues for all of the updates/improvements we've talked about

I'll just spend some time debugging why the "friend" update isn't working properly, and hopefully, come up with a fix soon

wallin · 2018-03-20T18:47:15Z

@jhecking update on the issue I'm seeing with RefreshFriends. In my local setup, every node reports two IPs (eg. 192.68.50.4 and 10.0.2.15). When iterating through these the NodeValidator connects to the first IP non-blocking and shortly thereafter to the second IP, causing Errno::EALREADY to be raised and the prepare code to be skipped.

Any ideas on how to best handle this case?

jhecking · 2018-03-21T07:40:46Z

I have pushed a couple more fixes to the aerospike:tls-support branch. Specifically:

NodeValidator didn't resolve IP addresses correctly.
Node::Refresh::Friends didn't add new hosts to a node's aliases list correctly.
I'd occasionally get exceptions when trying to close a socket that was already closed.
Improve the error message when unable to connect to a new host and the connection times out.

The last issue, is what you were running into, I think. Based on your IP addresses, it sounds like you are running the server in one or more Vagrant boxes, correct? I have the same setup. From my (macOS) host I am not able to connect to the 10.0.2.15 address that the server in the vagrant guest publishes. That would manifest itself in the Errno::EALREADY error. (I'm not 100% sure, why...)

After some struggles to get a second, valid IP address added to my vagrant guest, I'm now able to verify that the cluster lookup using the old services protocol (aka "friends") works as expected, even if the server publishes two separate IP addresses per node.

In a production setup, you would probably want to avoid publishing two separate IP addresses for clients, and specify the address to publish for clients using the access-address configuration. Unless you have two or more different sets of clients that need to connect via different IP addresses (e.g. internal/external IP), in which case you need to use alternate-access-address. See General Network Configuration docs for more details.

jhecking · 2018-03-21T10:01:57Z

lib/aerospike/node/refresh/peers.rb

+            peers.peers.each do |peer|
+              next if ::Aerospike::Cluster::FindNode.(cluster, peers, peer.node_name)
+
+              node_validated = false


Should this be peers_validated? (And on lines 34 and 41 as well.)

Turned out I missed this part of transcribing the Java peer refresh code
https://github.com/aerospike/aerospike-client-java/blob/7352c15c04015f4bd4f6985e138fa931179fb5d7/client/src/com/aerospike/client/cluster/Node.java#L363-L365

jhecking · 2018-03-21T10:26:33Z

spec/lib/aerospike/peers/parse_spec.rb

@@ -0,0 +1,51 @@
+# frozen_string_literal: true
+
+RSpec.describe Aerospike::Peers::Parse do


Can you please move the specs under spec/lib/aerospike to just spec/aerospike instead?

jhecking · 2018-03-21T10:34:45Z

I think we are getting close to the point where this PR is ready to be merged! Some outstanding items:

We'll unfortunately have to add the standard license boilerplate to all new files - I've already done this and committed it at ddbfccf. Please pull from my branch again.
There are still a number of pending specs. Are you planning to complete some more of them?
We should probably do a rebase against master and split up the PR into a smaller set of commits. Maybe we should even split the PR into two separate PRs to add the peers protocol and TLS support separately? What do you think?

Anything else I'm missing?

wallin · 2018-03-21T15:33:36Z

Great! Thanks for adding the license texts and making the "friend fixes"

I'll go ahead and add some more specs for the less entangled modules. Some, like eg. Refresh::Peers, are still a bit too complex to write good unit tests for imho, so I'll wait with that
I'd be glad to merge this in a cleaner way, however, I don't have a clear picture of how that would happen, so feel free to take the lead on this one, and I'd be happy to oblige. One drawback of splitting up the PR in separate new ones, if that we'd have to re-do a lot of the manual testing that has gone into this PR.
I'll think about if there is something else, but nothing that comes to mind right now

jhecking · 2018-03-23T01:56:32Z

I'll go ahead and add some more specs for the less entangled modules. Some, like eg. Refresh::Peers, are still a bit too complex to write good unit tests for imho, so I'll wait with that

Hmm, these four files still have very low test coverage: node/refresh/friends.rb, node/refresh/peers.rb, socket/ssl.rb, and socket/base.rb. I'm a bit worried especially about Node::Refresh::Friends and Socket::SSL because these two features do not get used in the tests at all; at least Node::Refresh::Peers and Socket::Base get covered somewhat in the course of the execution of the rest of the test suite.

But I'm not going to treat this as a blocker. Other than these four files, the test coverage for your changes is very good! Thanks for that! I'd just ask that you remove the specs that are currently empty except for a pending statement, if you don't have the intention to complete them at the moment.

I'd be glad to merge this in a cleaner way, however, I don't have a clear picture of how that would happen, so feel free to take the lead on this one, and I'd be happy to oblige. One drawback of splitting up the PR in separate new ones, if that we'd have to re-do a lot of the manual testing that has gone into this PR.

Sure, I can help with that. However, were you still going to add wrappers for the Node::Refresh::* service methods to Node? That's the last pending change that I am aware of at the moment.

jhecking · 2018-03-24T08:55:21Z

I added a few specs for Socket::SSL class here: f086c45. Not great, but better than nothing. ;-)

Thinking about how we can setup the CI env. so that we can actually test TLS connections end-to-end.

wallin · 2018-03-24T11:09:04Z

were you still going to add wrappers for the Node::Refresh::* service methods to Node? That's the last pending change that I am aware of at the moment.

Done!

I added a few specs for Socket::SSL class here: f086c45

Thanks for taking that off my chest! however, I took the liberty to re-write them in a more Rspec way. Sorry, being very anal about this ;)

There are still a number of pending specs. Are you planning to complete some more of them?

I added some specs for socket/base.rb and socket/tcp.rb and deleted the rest of the pending files

'm a bit worried especially about Node::Refresh::Friends

I share your concern. At some point, I'd very much like to refactor that module into smaller methods so we can do meaningful unit tests, but for now, it's best to just keep it as similar to the Java implementation as possible. Would there be any way of testing this on an integration level? Like, disable peers support on the AS cluster?

jhecking · 2018-03-24T14:12:59Z

spec/aerospike/socket/ssl_spec.rb

+    end
+  end
+
+  describe '::create_context' do


Why are there two separate describe blocks for ::create_context?

jhecking · 2018-03-24T14:14:56Z

spec/aerospike/socket/ssl_spec.rb

+      let(:pkey) { resource('ssl', 'test.key.pem') }
+      let(:ssl_options) { { cert_file: cert, pkey_file: pkey } }
+
+      before { allow_any_instance_of(OpenSSL::SSL::SSLContext).to receive(:add_certificate) }


Change before { ... } to before do ... end? 'Cause I'm anal about this... ;-)

jhecking · 2018-03-24T14:30:15Z

wrappers for the Node::Refresh::* service methods ... Done!

Thanks!

I took the liberty to re-write [the specs for Socket::SSL] in a more Rspec way. Sorry, being very anal about this ;)

No problem. We all got our pet peeves. ;-)

I added some specs for socket/base.rb and socket/tcp.rb and deleted the rest of the pending files

Yay!

Node::Refresh::Friends: I share your concern. At some point, I'd very much like to refactor that module into smaller methods so we can do meaningful unit tests, but for now, it's best to just keep it as similar to the Java implementation as possible. Would there be any way of testing this on an integration level? Like, disable peers support on the AS cluster?

I don't think we can disable peers support on the server. But we could make it a client side setting - for testing only. Just pull out this one line in Cluster#refresh_nodes into a new method and override the value based on some new setting in ClientPolicy:

peers.use_peers = nodes.all? { |node| node.supports_feature?('peers') }

jhecking · 2018-03-24T14:42:26Z

Pull out this one line in Cluster#refresh_nodes into a new method and override the value based on some new setting in ClientPolicy.

Or maybe better yet, stub that method to always return false for the tests and forget about adding a setting to ClientPolicy which should never be used anyway.

wallin · 2018-03-24T18:07:05Z

Or maybe better yet, stub that method to always return false for the tests and forget about adding a setting to ClientPolicy which should never be used anyway.

Great idea. I did that and it seems to work, so at least now the code path is exercised

Let me know how you think we should proceed with merging

jhecking · 2018-03-26T09:01:49Z

Great idea. I did that and it seems to work, so at least now the code path is exercised

Yay! Finally codecov is happy as well. :-)

Let me know how you think we should proceed with merging

I've taken all your changes, rebased them on master and then grouped them into just 3 commits for cleanup, peers protocol support and TLS support: https://github.com/aerospike/aerospike-client-ruby/compare/tls-support. You can run a diff against your own branch and the changes should be minimal. (I decided to remove support for Ruby 2.2.)

Let me know what you think. If it looks good to you, then you can hard reset your branch to the HEAD of my own branch and force push back to GitHub. And unless you think there is anything else left to do, I would then merge your PR back to master.

* Move cluster/node/node_validator * Add helper methods to Node to access reference_count, responded, active state * Fix URL in license boilerplace * Remove space at end of line * NodeValidator: Improve resolution of aliases * Specs: Always require spec_helper * Codecov: Ignore specs * Improve debug logs * Misc. minor stylistic changes

* Requires Aerospike Server v3.10 or later. Replaces use of 'services' info command used on older servers for cluster discovery. * Refactor tend logic: Remove from Cluster/Node models and break up into several smaller service modules. * Add specs covering both old and new protocol.

* Rename and split up Cluster::Connection into Socket::TCP and Socket::SSL * Add new ssl-options to ClientPolicy to setup TLS connection * Refactor create-connection logic into separate service module * Extend peers protocol support to use peers-tls-std when using TLS * Refactor parsing logic for hosts lists to support optional tls-name * Add new InvalidCredentials error class * Remove Ruby 2.2 from CI test matrix * Add external OpenSSL gem dependency

wallin · 2018-03-27T00:01:44Z

Let me know what you think. If it looks good to you, then you can hard reset your branch to the HEAD of my own branch and force push back to GitHub. And unless you think there is anything else left to do, I would then merge your PR back to master.

Great work. Just rebased and everything looks nice and tidy. I'm not missing anything afaik. Thanks for being so responsive and thorough in your comments, it's been a pleasure implementing this together with you.

jhecking · 2018-03-27T01:42:22Z

Merged!

Thanks for being so responsive and thorough in your comments, it's been a pleasure implementing this together with you.

Likewise. Looking forward to your future contributions!

jhecking · 2018-03-27T03:05:33Z

Tagged v2.6.0 and released it to RubyGems.org.

jhecking reviewed Feb 27, 2018

View reviewed changes

jhecking reviewed Feb 28, 2018

View reviewed changes

jhecking reviewed Mar 15, 2018

View reviewed changes

jhecking reviewed Mar 16, 2018

View reviewed changes

jhecking reviewed Mar 21, 2018

View reviewed changes

jhecking reviewed Mar 24, 2018

View reviewed changes

jhecking added the enhancement label Mar 25, 2018

wallin and others added 4 commits March 26, 2018 17:18

Update changelog and min. required Ruby version

bf73d4e

wallin force-pushed the tls-support branch from c49d4eb to bf73d4e Compare March 26, 2018 23:55

jhecking merged commit 41718d9 into aerospike:master Mar 27, 2018

wallin deleted the tls-support branch April 3, 2018 02:03


		generation = gen_string.to_i

		if node.partition_generation.value != generation

		@@ -0,0 +1,51 @@
		# frozen_string_literal: true

		RSpec.describe Aerospike::Peers::Parse do

TLS support + peers protocol #59

TLS support + peers protocol #59

Conversation

wallin commented Feb 26, 2018 • edited Loading

jhecking commented Feb 26, 2018

wallin commented Feb 26, 2018 • edited Loading

jhecking left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wallin commented Feb 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wallin commented Mar 5, 2018

jhecking commented Mar 5, 2018

wallin commented Mar 6, 2018

codecov-io commented Mar 10, 2018 • edited Loading

Codecov Report

wallin commented Mar 11, 2018

jhecking commented Mar 11, 2018 via email

wallin commented Mar 11, 2018

jhecking commented Mar 15, 2018

jhecking commented Mar 15, 2018

wallin commented Mar 15, 2018

wallin commented Mar 15, 2018

wallin commented Mar 15, 2018

jhecking commented Mar 15, 2018 via email

jhecking left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khaf commented Mar 15, 2018

jhecking commented Mar 15, 2018

wallin commented Mar 15, 2018 • edited Loading

wallin commented Mar 15, 2018

jhecking commented Mar 16, 2018

jhecking commented Mar 16, 2018

jhecking commented Mar 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhecking Mar 16, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wallin commented Mar 19, 2018

jhecking commented Mar 20, 2018

jhecking commented Mar 20, 2018

wallin commented Mar 20, 2018

wallin commented Mar 20, 2018

jhecking commented Mar 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhecking commented Mar 21, 2018

wallin commented Feb 26, 2018 •

edited

Loading

wallin commented Feb 26, 2018 •

edited

Loading

jhecking left a comment •

edited

Loading

codecov-io commented Mar 10, 2018 •

edited

Loading

wallin commented Mar 15, 2018 •

edited

Loading

jhecking Mar 16, 2018 •

edited

Loading

wallin commented Mar 21, 2018 •

edited

Loading