Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IDEA] - Add a health metrics if the node-socket is reachable at all #154

Closed
gitmachtl opened this issue Dec 25, 2021 · 16 comments
Closed
Assignees
Labels
help welcomed Good for newcomers server Issues which regard the server

Comments

@gitmachtl
Copy link
Contributor

gitmachtl commented Dec 25, 2021

Describe your idea, in simple words.

Running for example node 1.33.0 in P2P mode with

"DiffusionMode": "InitiatorOnly",

in the config will not create a local listening port anymore. So we can't use cardanoPing/cncli to check if the node is alive.

If such a node stops to work or was shutdown, there is currently no flag for that in the
ogmios health check:

curl -s http://127.0.0.1:1337/health | jq
{
  "startTime": "2021-12-25T10:16:39.579348019Z",
  "lastKnownTip": {
    "slot": 48861271,
    "hash": "333b265fc2a34f230f0f7a579e76fb0c841be11549832f290e77822cbbe0fec2",
    "blockNo": 6672097
  },
  "lastTipUpdate": "2021-12-25T10:19:22.281947735Z",
  "networkSynchronization": 1,
  "currentEra": "Alonzo",
  "metrics": {
    "activeConnections": 0,
    "totalConnections": 0,
    "totalUnrouted": 0,
    "sessionDurations": {
      "mean": 0,
      "min": 0,
      "max": 0
    },
    "runtimeStats": {
      "currentHeapSize": 209,
      "gcCpuTime": 1240003707,
      "cpuTime": 1722094732,
      "maxHeapSize": 325
    },
    "totalMessages": 0
  }
}

Thats a sample output after the node was shut down.

So using the health metrics, there is only one way currently to see if the node is really ok by comparing the lastKnownTip with the theoretical calculated one from the genesis files and do a threshold if it falls too far behind.

The Error-Log is showing a warning like:

{"severity":"Warning","timestamp":"2021-12-25T10:37:23.904043804Z","thread":"7","message":{"Health":{"tag":"HealthFailedToConnect","socket":"/home/.../db/node.socket","retryingIn":5}},"version":"v5.0.0"}

"networkSynchronization": 1, also stays on 1(=100%).

Why is it a good idea?

It would be nice to have a flag that can show if the current connection to the node via the node socket is ok or not. We get error outputs in the logs, but not on the health check here.

@KtorZ KtorZ added enhancement help welcomed Good for newcomers server Issues which regard the server labels Dec 25, 2021
@KtorZ
Copy link
Member

KtorZ commented Dec 25, 2021

Good point.
Note that the last know tip also contains a UTC timestamp so, in principle, this is "enough" to know in it's starting to drift, albeit not practical.

It's also unfortunate that the network synchronization is only updated on every new tip, while simple, it means that the value is only refreshed when the connection is up. Perhaps having a background thread to create artificial ticks would be better here.

@gitmachtl
Copy link
Contributor Author

gitmachtl commented Dec 25, 2021

Would be possible to set "networkSynchronization": null, if there is no socket connection to the node? This would also handle the start up condition if ogmios is started before the node, reporting a networkSynchronization of 0% in that case is not 100% correct. Reporting a null would cover it, because "we don't know" the value at that state.

@KtorZ KtorZ self-assigned this Dec 27, 2021
@KtorZ KtorZ closed this as completed in 11028f9 Dec 29, 2021
@redoracle
Copy link

redoracle commented Dec 30, 2021

what about implementing it in the docker images of ogmios as healthcheck.sh script?

Currently neither curl nor jq are installed on the docker image.

@KtorZ
Copy link
Member

KtorZ commented Dec 30, 2021

@redoracle -> implementing what exactly in the docker image 🤔 ?

@redoracle
Copy link

redoracle commented Dec 30, 2021

I meant implementing the healthcheck.sh script as usual docker images do in order to verify the container is running properly otherwise the healthcheck script will trigger the container restart.

by using this command : curl -s http://127.0.0.1:1337/health | jq
I guess it is possible to verify some of the metrics to understand if the ogmios container is running properly.

Alternatively I can create one and map it inside the container, but at least I need preinstalled: curl and jq, in order to make it work.

attached here an example of a container with health-check and one without.

Screen Shot 2021-12-30 at 4 42 01 PM

@KtorZ
Copy link
Member

KtorZ commented Dec 30, 2021

Seems like this can work nicely with just wget as follows:

HEALTHCHECK --interval=10s --timeout=5s --retries=1 CMD \
  [ connected == $(wget http://localhost:1337 | sed 's/.*"connectionStatus":"\([a-z]\+\)".*/\1/') ]

@KtorZ
Copy link
Member

KtorZ commented Dec 30, 2021

Note: I've started re-working the docker images recently to avoid having to maintain two build systems. The new images are based on the Nix build and make heavy use of the caching:

#  This Source Code Form is subject to the terms of the Mozilla Public
#  License, v. 2.0. If a copy of the MPL was not distributed with this
#  file, You can obtain one at http://mozilla.org/MPL/2.0/.

#                                                                              #
# ------------------------------- SETUP  ------------------------------------- #
#                                                                              #

FROM nixos/nix:2.3.11 as build

RUN echo "substituters = https://cache.nixos.org https://hydra.iohk.io" >> /etc/nix/nix.conf &&\
    echo "trusted-public-keys = cache.nixos.org-1:6NCHdD59X431o0gWypbMrAURkbJ16ZPMQFGspcDShjY= hydra.iohk.io:f/Ea+s+dFdN+3Y/G+FDgSq+a5NEWhJGzdjvKNGv0/EQ=" >> /etc/nix/nix.conf

WORKDIR /app
RUN nix-shell -p git --command "git clone --depth 1 https://github.com/input-output-hk/cardano-configurations.git"

WORKDIR /app/ogmios
RUN nix-env -iA cachix -f https://cachix.org/api/v1/install && cachix use cardano-ogmios
COPY . .
RUN nix-build -A ogmios.components.exes.ogmios -o dist
RUN cp -r dist/* . && chmod +w dist/bin && chmod +x dist/bin/ogmios

#                                                                              #
# --------------------------- BUILD (ogmios) --------------------------------- #
#                                                                              #

FROM busybox as ogmios

ARG NETWORK=mainnet

LABEL name=ogmios
LABEL description="A JSON WebSocket bridge for cardano-node."

COPY --from=build /app/ogmios/bin/ogmios /bin/ogmios
COPY --from=build /app/cardano-configurations/network/${NETWORK} /config

EXPOSE 1337/tcp
STOPSIGNAL SIGINT
HEALTHCHECK --interval=10s --timeout=5s --retries=1 CMD \
  [ connected == $(wget http://localhost:1337 | sed 's/.*"connectionStatus":"\([a-z]\+\)".*/\1/') ]
ENTRYPOINT ["/bin/ogmios"]

#                                                                              #
# --------------------- RUN (cardano-node & ogmios) -------------------------- #
#                                                                              #

FROM inputoutput/cardano-node:1.31.0 as cardano-node-ogmios

ARG NETWORK=mainnet

SHELL ["/bin/bash", "-o", "pipefail", "-c"]

LABEL name=cardano-node-ogmios
LABEL description="A JSON WebSocket bridge for cardano-node w/ a cardano-node."

COPY --from=build /app/ogmios/bin/ogmios /bin/ogmios
COPY --from=build /app/cardano-configurations/network/${NETWORK} /config

RUN mkdir -p /ipc

WORKDIR /root
COPY scripts/cardano-node-ogmios.sh cardano-node-ogmios.sh
# Ogmios, cardano-node, ekg, prometheus
EXPOSE 1337/tcp 3000/tcp 12788/tcp 12798/tcp
STOPSIGNAL SIGINT
HEALTHCHECK --interval=10s --timeout=5s --retries=1 CMD \
  [ connected == $(wget http://localhost:1337 | sed 's/.*"connectionStatus":"\([a-z]\+\)".*/\1/') ]
CMD ["bash", "cardano-node-ogmios.sh" ]

Still work-in-progress however as the cardano-node-ogmios image isn't working properly (I need to overwrite the entrypoint of the image to the script doing the basic process monitoring.

@redoracle
Copy link

wget http://localhost:1337 | sed 's/.*"connectionStatus":"\([a-z]\+\)".*/\1/'

that

Seems like this can work nicely with just wget as follows:

HEALTHCHECK --interval=10s --timeout=5s --retries=1 CMD \
  [ connected == $(wget http://localhost:1337 | sed 's/.*"connectionStatus":"\([a-z]\+\)".*/\1/') ]

that's nice too, but still wget is missing as preinstalled package. while sed is there.

@redoracle
Copy link

# Ogmios, cardano-node, ekg, prometheus
EXPOSE 1337/tcp 3000/tcp 12788/tcp 12798/tcp

Do you really need to expose all those ports if only used internally?
normally the internal process will open those ports internally anyway, and if needed those can be mapped with "-p" to the public host interface.

BTW very good point migrating to nix, I like it very much.

@redoracle
Copy link

redoracle commented Jan 2, 2022

wget http://localhost:1337 | sed 's/.*"connectionStatus":"\([a-z]\+\)".*/\1/'

root@973ea926352e:/# wget http://localhost:1337 | sed 's/."connectionStatus":"([a-z]+)"./\1/'
--2022-01-02 12:55:28-- http://localhost:1337/
Resolving localhost (localhost)... 127.0.0.1, ::1
Connecting to localhost (localhost)|127.0.0.1|:1337... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: 'index.html.20'

index.html.20 [ <=> ] 7.63K --.-KB/s in 0s

2022-01-02 12:55:28 (1.01 GB/s) - 'index.html.20' saved [7811]

not sure wget does the same of curl... or am I missing some other option?

the following returns the value that tell's us that ogmio is in sync, right?
root@973ea926352e:/# curl -s http://127.0.0.1:1337/health | jq .networkSynchronization
1
which I presume implies that is connected.
root@973ea926352e:/# curl -s http://127.0.0.1:1337/health | jq .connectionStatus
"connected"

@KtorZ
Copy link
Member

KtorZ commented Jan 2, 2022

that's nice too, but still wget is missing as preinstalled package. while sed is there.

Even on the new images with Nix, that is, on top of BusyBox? I thought wget was available in BusyBox ... 🤔

Do you really need to expose all those ports if only used internally?

Those aren't internal though. except maybe 3000/tcp. ekg and prometheus are used for metrics, and ogmios is used for local clients.

not sure wget does the same of curl... or am I missing some other option?

Ah! My mistake... We need to hit the health endpoint here! So http://localhost:1337/health !!

@redoracle
Copy link

redoracle commented Jan 2, 2022

So http://localhost:1337/health !!

ok, but wget keeps saving the file not printing it, therefore I need an additional step to retrive the particular metric which says that the node is connected and in sync from the saved file. right?

@redoracle
Copy link

So http://localhost:1337/health !!

ok, but wget keeps saving the file not printing it, therefore I need an additional step to retrive the particular metric which says that the node is connected and in sync from the saved file. right?

what about this?
wget -qO- http://localhost:1337/health | sed 's/.*"connectionStatus":"//g' | sed 's/connected"}/1/g'

@redoracle
Copy link

for now I got it working with an healthchek.sh mapped inside the container as follow:

if ! command -v wget;
then
apt update && apt -y install wget;
fi

result=$(wget -qO- http://localhost:1337/health | sed 's/.*"connectionStatus":"//g' | sed 's/connected"}/0/g')

if [ $result != 0 ]; then exit 1; fi

I guess with the NIX version it wouldn't work though :)

Screen Shot 2022-01-02 at 2 59 37 PM

@KtorZ
Copy link
Member

KtorZ commented Jan 3, 2022

I figured that a nicer way to do all this would be to have a proper health-check command in Ogmios to begin with, so I implemented:

$ ogmios health-check --help
Handy command to check whether an Ogmios server is up-and-running, and correctly connected to a Network / cardano-node.

This can, for example, be wired to Docker's HEALTHCHECK feature easily.

Usage: ogmios health-check [--port TCP/PORT]
  Performs a health check against a running server.

Available options:
  -h,--help                Show this help text
  --port TCP/PORT          Port to listen on. (default: 1337)

(see 62691fb)

It exits with 0 or 1, depending on whether it could perform a health check on a running server. Dead-simple to configure the HEALTHCHECK in the Dockerfile with that:

HEALTHCHECK --interval=10s --timeout=5s --retries=1 CMD /bin/ogmios health-check

@redoracle
Copy link

redoracle commented Jan 3, 2022

That's very thoughtful and very nice!!

Well done! Tnx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help welcomed Good for newcomers server Issues which regard the server
Projects
No open projects
Status: Done
Development

No branches or pull requests

3 participants