Challenges in Zabbix
Your challenge starts when you convince your boss that you are responsible for implementing an open source tool to monitor the IT environment for your company. Time progresses and the Zabbix platform starts gaining more and more responsibilities and visibility. However, suppose some important steps were forgotten, or you didn't have all of the information needed to carry out more detailed planning or sizing. The fact is that Zabbix has earned the reputation of an all-seeing eye, and now your company uses Zabbix to support business growth and ensure proper delivery of services.
Since 2007, I have been working with the Zabbix community, and since 2012, I have been a Zabbix certified trainer. In these days, I have heard and seen a lot of guys talking about performance issues with Zabbix. I have no doubt that most of these problems are related to a misconfiguration or misunderstanding about Zabbix's parameters and concepts. Some basic information about Zabbix that people usually don't know or don't care about is as follows:
- The number of hosts isn't the most important thing for performance: Usually, people ask, "How many hosts can I manage with Zabbix?" The right question should be, "How many new values per second (nvps) can I manage with Zabbix?" So, you need to know that one host gathering 100 items is the same as 100 hosts gathering one item each.
- The default templates shouldn't be used in a production environment: It happens that default templates are the only examples that show you how to use item keys, triggers, graphs, LLD, macros, and other Zabbix features and functions. Such templates usually have gathering intervals shorter than what you really need.
- How many users will use the Zabbix interface: This is a point that is almost always forgotten. Usually, people start using Zabbix alone or together with a few guys, and they have only a few maps and screens. But what if you need to create a lot of users to use and explore the Zabbix interface? What if your boss asks you to create some dashboards, putting a lot of data together? At this point, you'll start to think about web server performance.
- Using a default database deployment: The MySQL database comes with almost all Linux distributions, but
my.cnf
isn't fit to work with Zabbix. I mean, the default MySQL deployment isn't the best configuration that you can work with. Of course, you will need to adjust some basic (maybe advanced) parameters to attain the best performance with Zabbix. People don't care about read or write parameters. It's very important to know how Zabbix works and then prepare your database to work with Zabbix. - Item types and value types will directly affect performance: Do you know that active items are better than passive items? It's very important to know that when using active items, the Zabbix server has less work to do, and each Zabbix agent handles its own queue. Do you know that numerical data is better than text data? Zabbix uses different tables for each data type (float, integer, text, log, and so on), and each database table has a different row configuration.
- Time retention needs to be shorter than the template's default configuration: By default, Zabbix works for 90 days to retain historical data and 365 days (a year) to retain trends data. Of course, you don't need to retain 90 days of historical data from an
icmp.ping
item key. Nor do you need to retain 365 days of trends data from this key. So, you need to choose the right period to retain your data (historical and trends). You will need to retain some data for a long time, and you can get rid of the other data earlier. - The number of triggers and the functions with them will affect performance: Some people don't realize that a trigger with a very simple function, such as
last()
, has better performance than a trigger with a more complex function, such asmin()
,max()
, oravg()
. - Items that are not supported can affect Zabbix's performance: The Zabbix server will always try to gather these items, and if they have some error, the Zabbix server will work without results.