One of the things that we like about the Elrond community, is that node operators as well as Elrond developers openly share their experiences. The most important platform for this is the Elrond Validators Telegram group, which is a safe place to ask any question you have about running Elrond nodes. This transparency there has been beneficial to the whole group, including us at Viastake!
Since we ran our first Elrond node on the Elrond testnet in August 2019, we have constantly improved our methods. Monitoring our nodes as well as our infrastructure 24/7 has been a big step forward. In this blog post, we like to share our monitoring methods with you.
On our GitHub page, we are sharing much of our monitoring setup in two repositories:
- Setup guide for an Icinga 2 master node with Grafana integration employing InfluxDB (Ubuntu 18.04)
- Setup guide for monitoring Elrond nodes remotely with Icinga2 and Grafana (Ubuntu 18.04)
If you want to start monitoring Elrond nodes with Icinga 2 after reading this blog post, make sure that you start with the first repo and then continue with the second one. Together, they should get you on your way with monitoring! We hope you and the entire Elrond community will benefit from them.
Choosing a monitoring solution
There are many options out there for monitoring your infrastructure and nodes. Icinga 2 is the one we use, but other Elrond node operators also use Prometheus, Zabbix, Nagios, Netdata, and LibreNMS, to name a few examples. After seeing a few demos of what Icinga 2 can do, we saw that they have good docs, a vibrant and helpful community, and an interesting exchange for plugins. We don’t want to make a case for Icinga 2 over all the other options here, but we can say that Icinga 2 gives us the customizability we like, including some nice display options and support for all kinds of notifications. Also, Icinga Web 2 has a user-friendly GUI, including the ability to integrate Grafana graphs.
What we monitor
From our master monitoring server, we monitor both the system-wide performance as well as the nodes’ performances on our remote agent servers. This is a list of metrics that we monitor:
|Ping||host connected||host disconnected||host disconnected|
|Agent health||zone connected||zone disconnected||zone disconnected|
|Load average over 1 minute||≤ 5 x number of CPUs||> 5 x number of CPUs||> 6 x number of CPUs|
|Load average over 5 minutes||≤ 3 x number of CPUs||> 3 x number of CPUs||> 5 x number of CPUs|
|Load average over 15 minutes||≤ 2 x number of CPUs||> 2 x number of CPUs||> 3 x number of CPUs|
|RAM usage||≤ 70%||> 70%||> 90%|
|Swap usage||≤ 60%||> 60%||> 80%|
|Disk storage||≤ 70%||> 70%||> 90%|
|Open files||≤ 50% of limit||> 50% of limit||> 80% of limit|
|Running processes||≤ 250||> 250||> 400|
|Active users||≤ 3||> 3||> 6|
|Network incoming traffic||≤ 50 Mbps||> 50 Mbps||> 100 Mbps|
|Network outgoing traffic||≤ 50 Mbps||> 50 Mbps||> 100 Mbps|
|Elrond node metric||OK||Warning||Critical|
|App version||string starting with v||not OK||not OK|
|Public validator key||192-char string||not OK||not OK|
|Node type||validator||observer / undefined||observer / undefined|
|Number of connected peers||≥ 15||< 15||< 5|
|CPU usage node||≤ 70% of avbl/node||> 70% of avbl/node||> 90% of avbl/node|
|RAM usage node||≤ 70% of avbl/node||> 70% of avbl/node||> 90% of avbl/node|
|Network incoming traffic node||≤ 20 Mbps||> 20 Mbps||> 40 Mbps|
|Network outgoing traffic node||≤ 20 Mbps||> 20 Mbps||> 40 Mbps|
|Rejected/proposed blocks||≤ 20%||> 20%||> 50%|
These triggers are still being tweaked, and some new metrics will likely be added in the future.
Email and Telegram notifications
Notifications for service and host problems are an integral part of any monitoring setup. We have configured both email and Telegram notifications for Icinga 2. At first we had some issue with the email notifications, because the emails were considered spam. However, after applying a few tips & tricks, this issue was solved and we even reached a 10/10 anti-spam score.
Currently we have one email account and one Telegram bot configured for our notifications. In the future we’ll probably add accounts for just the CRITICAL messages, the ones we want to wake us up in the middle of the night.
Further analysis with Grafana
We have integrated Grafana into Icinga Web 2 to visualize the performance metrics for each check service. Still it can be useful to get a graphical overview of the agent server as a whole, or to compare the behavior of the different agent servers. Especially when troubleshooting, it can be useful to visualize trends in Grafana dashboards, so we use those as well.
We hope you enjoyed this blog post. If you are an Elrond node operator and want to start using Icinga 2 for monitoring your systems and your nodes, do check out Viastake on GitHub, particularly the repos that we mentioned in the introduction.