Zabbix can be defined as a distributed monitoring system with a centralized web interface (on which we can manage almost everything). Among its main features, we will highlight the following ones:
- Zabbix has a centralized web interface
- The server can be run on most Unix-like operating systems
- This monitoring system has native agents for most Unix, Unix-like, and Microsoft Windows operation systems
- The system is easy to integrate with other systems, thanks to the API available in many different programming languages and the option that Zabbix itself provides
- Zabbix can monitor via SNMP (v1, v2, and v3), IPMI, JMX, ODBC, SSH, HTTP(s), TCP/UDP, and Telnet
- This monitoring system gives us the possibility of creating custom items and graphs and interpolating data
- The system is easy to customize
The following diagram shows the three-tier system of a Zabbix architecture:
The Zabbix architecture for a large environment is composed of three different servers/components (that should be configured on HA as well). These three components are as follows:
- A web server
- A database server
- A Zabbix server
The whole Zabbix infrastructure in large environments allows us to have two other actors that play a fundamental role. These actors are the Zabbix agents and the Zabbix proxies. An example is represented in the following figure:
On this infrastructure, we have a centralized Zabbix server that is connected to different proxies, usually one for each server farm or a subnetwork.
The Zabbix server will acquire data from Zabbix proxies, the proxies will acquire data from all the Zabbix agents connected to it, all the data will be stored on a dedicated RDBMS, and the frontend will be exposed with a web interface to the users. Looking at the technologies used, we see that the web interface is written in PHP and that the server, proxies, and agents are written in C.
Note
The server, proxies, and agents are written in C to give the best performance and least resource usage possible. All the components are deeply optimized to achieve the best performance.
We can implement different kinds of architecture using proxies. There are several types of architectures and, in the order of complexity, we find the following ones:
- The single-server installation
- One server and many proxies
- Distributed installation (available only until 2.3.0)
The single-server installation is not suggested in a large environment. It is the basic installation, where single servers do the monitoring, and it can be considered a good starting point.
Most likely, in our infrastructure, we might already have a Zabbix installation. Zabbix is quite flexible, and this permits us to upgrade this installation to the next step: proxy-based monitoring.
Proxy-based monitoring is implemented with one Zabbix server and several proxies, that is, one proxy per branch or data center. This configuration is easy to maintain and offers the advantage to have a centralized monitoring solution. This kind of configuration is the right balance between large environment monitoring and complexity. From this point, we can (with a lot of effort) expand our installation to a complete and distributed monitoring architecture. The installation consisting of one server and many proxies is the one shown in the previous diagram.
Starting from the 2.4.0 version of Zabbix, the distributed scenarios that include nodes are no longer a possible setup. Indeed, if you download the source code of the Zabbix distribution discussed in this book, and then Zabbix 2.4.3, you'll see that the branch of code that was managing the nodes has been removed.
All the possible Zabbix architectures will be discussed in detail in Chapter 2, Distributed Monitoring.
The installation that will be covered in this chapter is the one consisting of a server for each of the following base components:
- A web frontend
- A Zabbix server
- A Zabbix database
We will start describing this installation because:
- It is a basic installation that is ready to be expanded with proxies and nodes
- Each component is on a dedicated server
- This kind of configuration is the starting point to monitor large environments
- It is widely used
- Most probably, it will be the starting point of your upgrade and expansion of the monitoring infrastructure.
Actually, this first setup for a large environment, as explained here, can be useful if you are looking to improve an existing monitoring infrastructure. If your current monitoring solution is not implemented in this way, the first thing to do is plan the migration on three different dedicated servers.
Once the environment is set up on three tiers but is still giving poor performance, you can plan and think which kind of large environment setup will be a perfect fit for your infrastructure.
When you monitor your large environment, there are some points to consider:
- Use a dedicated server to keep things easy to extend
- Keep things easy to extend and implement a high-availability setup
- Keep things easy to extend and implement a fault-tolerant architecture
On this three-layer installation, the CPU usage of the server component will not be really critical at least for the Zabbix server. The CPU consumption is directly related to the number of items to store and the refresh rate (number of samples per minute) rather than the memory.
Indeed, the Zabbix server will not consume excessive CPU but is a bit greedier for memory. We can consider that four CPU cores with 8 GB of RAM can be used for more than 1,000 quad hosts without any issues.
Basically, there are two ways to install Zabbix:
- Downloading the latest source code and compiling it
- Installing it from packages
There is also another way to have a Zabbix server up and running, that is, by downloading the virtual appliance, but we don't consider this case as it is better to have full control of our installation and be aware of all the steps. Also, the major concern about the virtual appliance is that Zabbix itself defines the virtual appliance that is not production ready directly on the download page http://www.zabbix.com/download.php.
The installation from packages gives us the following benefits:
- It makes the process of upgrading and updating easier
- Dependencies are automatically sorted
The source code compilation also gives us benefits:
- We can compile only the required features
- We can statically build the agent and deploy it on different Linux flavors
- We can have complete control over the update
It is quite usual to have different versions of Linux, Unix, and Microsoft Windows in a large environment. These kinds of scenarios are quite diffused on a heterogeneous infrastructure, and if we use the agent distribution package of Zabbix on each Linux server, we will, for sure, have different versions of the agent and different locations for the configuration files.
The more standardized we are across the server, the easier it will be to maintain and upgrade the infrastructure. --enable-static
gives us a way to standardize the agent across different Linux versions and releases, and this is a strong benefit. The agent, if statically compiled, can be easily deployed everywhere, and, for sure, we will have the same location (and we can use the same configuration file apart from the node name) for the agent and their configuration file. The deployment will be standardized; however, the only thing that may vary is the start/stop script and how to register it on the right init
runlevel.
The same kind of concept can be applied to commercial Unix bearing in mind its compilation by vendors, so the same agent can be deployed on different versions of Unix released by the same vendor.
Before compiling Zabbix, we need to take a look at the prerequisites. The web frontend will need at least the following versions:
- Apache (1.3.12 or later)
- PHP (5.3.0 or later)
Instead, the Zabbix server will need:
- An RDBMS: The open source alternatives are PostgreSQL and MySQL
zlib-devel
mysql-devel
: This is used to support MySQL (not needed on our setup)postgresql-devel
: This is used to support PostgreSQLglibc-devel
curl-devel
: This is used in web monitoringlibidn-devel
: The curl-devel
depends on itopenssl-devel
: The curl-devel
depends on itnet-snmp-devel
: This is used on SNMP supportpopt-devel
: net-snmp-devel
might depend on itrpm-devel
: net-snmp-devel
might depend on itOpenIPMI-devel
: This is used to support IPMIiksemel-devel
: This is used for the Jabber protocolLibssh2-devel
sqlite3
: This is required if SQLite is used as the Zabbix backend database (usually on proxies)
To install all the dependencies on a Red Hat Enterprise Linux distribution, we can use yum
(from root
), but first of all, we need to include the EPEL repository with the following command:
Using yum install
, install the following package:
Tip
The iksemel-devel
package is used to send a Jabber message. This is a really useful feature as it enables Zabbix to send chat messages, Furthermore, Jabber is managed as a media type on Zabbix, and you can also set your working time, which is a really useful feature to avoid the sending of messages when you are not in the office.
Zabbix needs a user and an unprivileged account to run. Anyway, if the daemon is started from root, it will automatically switch to the Zabbix account if this one is present:
Note
The server should never run as root because this will expose the server to a security risk.
The preceding lines permit you to enforce the security of your installation. The server and agent should run with two different accounts; otherwise, the agent can access the Zabbix server's configuration. Now, using the Zabbix user account, we can download and extract the sources from the tar.gz
file:
Now, we will configure the sources where help
is available:
To configure the source for our server, we can use the following options:
Note
The zabbix_get
and zabbix_send
commands are generated only if --enable-agent
is specified during server compilation.
If the configuration is complete without errors, we should see something similar to this:
We will not run make install
but only the compilation with # make
. To specify a different location for the Zabbix server, we need to use a --
prefix on the configure options, for example, --prefix=/opt/zabbix
. Now, follow the instructions as explained in the Installing and creating the package section.
To configure the sources to create the agent, we need to run the following command:
Tip
With the make
command followed by the --enable-static
option, you can statically link the libraries, and the compiled binary will not require any external library; this is very useful to distribute the agent across different dialects of Linux.
Installing and creating the package
In both the previous sections, the command line ends right before the installation; indeed, we didn't run the following command:
I advise you not to run the make install
command but use the checkinstall
software instead. This software will create the package and install the Zabbix software.
You can download the software from ftp://ftp.pbone.net/mirror/ftp5.gwdg.de/pub/opensuse/repositories/home:/ikoinoba/CentOS_CentOS-6/x86_64/checkinstall-1.6.2-3.el6.1.x86_64.rpm.
Note that checkinstall
is only one of the possible alternatives that you have to create a distributable system package.
Note
We can also use a prebuild checkinstall
. The current release is checkinstall-1.6.2-20.4.i686.rpm
(on Red Hat/CentOS); the package will also need rpm-build
; then, from root, we need to execute the following command:
We also need to create the necessary directories:
The package made things easy; it is easy to distribute and upgrade the software, plus we can create a package for different versions of a package manager: RPM
, deb
, and tgz
.
Tip
checkinstall
can produce a package for Debian (option –D
), Slackware (option –S
), and Red Hat (option –R
). This is particularly useful to produce the Zabbix's agent package (statically linked) and to distribute it around our server.
Now, we need to convert to root or use the sudo checkinstall
command followed by its options:
If you don't face any issue, you should get the following message:
Now, to install the package from root, you need to run the following command:
Finally, Zabbix is installed. The server binaries will be installed in <prefix>/sbin
, utilities will be in <prefix>/bin
, and the man pages will be under the <prefix>/share
location.
To provide a complete picture of all the possible install methods, you need to be aware of the steps required to install Zabbix using the prebuilt rpm
packages.
The first thing to do is install the repository:
This will create the yum repo file, /etc/yum.repos.d/zabbix.repo
, and will enable the repository.
Now, it is easy to install our Zabbix server and web interface; you can simply run this command on the server:
And in the web server, bear in mind to first add the yum
repository:
To install the agent, you only need to run the following command:
Note
If you have decided to use the RPM packages, please bear in mind that the configuration files are located under /etc/zabbix/
. The book anyway will continue to refer to the standard configuration: /usr/local/etc/
.
Also, if you have a local firewall active where you're deploying your Zabbix agent, you need to properly configure iptables
to allow the traffic against Zabbix's agent port with the following command that you need to run as root:
For the server configuration, we only have one file to check and edit:
The configuration files are located inside the following directory:
We need to change the /usr/local/etc/zabbix_server.conf
file and write the username, relative password, and the database name there; note that the database configuration will be done later on in this chapter and that, by now, you can write the planned username/password/database name. Then, in the zabbix
account, you need to edit:
Change the following parameters:
Note
Now, our Zabbix server is configured and almost ready to go. zabbix_server.conflocation
depends on the sysconfdir
compile-time installation variable. Don't forget to take appropriate measures to protect access to the configuration file with the following command:
The location of the default external scripts will be as follows:
This depends on the datadir
compile-time installation variable. The alertscripts
directory will be in the following location:
Tip
This can be changed during compilation, and it depends on the datadir
installation variable.
Now, we need to configure the agent. The configuration file is where we need to write the IP address of our Zabbix server. Once done, it is important to add two new services to the right runlevel to be sure that they will start when the server enters on the right runlevel.
To complete this task, we need to install the start/stop scripts on the following:
/etc/init.d/zabbix-agent
/etc/init.d/zabbix-proxy
etc/init.d/zabbix-server
There are several scripts prebuilt inside the misc
folder located at the following location:
This folder contains different startup scripts for different Linux variants, but this tree is not actively maintained and tested, and may not be up to date with the most recent versions of Linux distributions, so it is better to take care and test it before going live.
Once the start/stop script is added inside the /etc/init.d
folder, we need to add them to the service list:
Now, all that is left is to tell the system which runlevel it should start them on; we are going to use runlevels 3 and 5:
Also, in case you have a local firewall active in your Zabbix server, you need to properly configure iptables
to allow traffic against Zabbix's server port with the following command that you need to run as root:
Currently, we can't start the server; before starting up our server, we need to configure the database.
Once we complete the previous step, we can walk through the database server installation. All those steps will be done on the dedicated database server. The first thing to do is install the PostgreSQL server. This can be easily done with the package offered from the distribution, but it is recommended that you use the latest 9.x stable version.
Red Hat is still distributing the 8.x on RHEL6.4. Also, its clones, such as CentOS and ScientificLinux, are doing the same. PosgreSQL 9.x has many useful features; at the moment, the latest stable, ready-for-production environment is Version 9.2.
To install PostgreSQL 9.4, there are some easy steps to follow:
- Locate the
.repo
files:- Red Hat: This is present at /
etc/yum/pluginconf.d/rhnplugin.conf [main]
- CentOS: This is present at
/etc/yum.repos.d/CentOS-Base.repo, [base]
and [updates]
- Append the following line on the section(s) identified in the preceding step:
- Browse to http://yum.postgresql.org and find your correct RPM. For example, to install PostgreSQL 9.4 on RHEL 6, go to http://yum.postgresql.org/9.4/redhat/rhel-6-x86_64/pgdg-redhat94-9.4-1.noarch.rpm.
- Install the repo with
yum localinstall http://yum.postgresql.org/9.4/redhat/rhel-6-x86_64/pgdg-centos94-9.4-1.noarch.rpm
. - Now, to list the entire
postgresql
package, use the following command: - Once you find our package in the list, install it using the following command:
- Once the packages are installed, we need to initialize the database:
Alternatively, we can also initialize this database:
- Now, we need to change a few things in the configuration file
/var/lib/pgsql/9.4/data/postgresql.conf
. We need to change the listen address and the relative port:We also need to add a couple of entries for zabbix_db
right after the following lines:
The local
keyword matches all the connections made in the Unix-domain sockets. This line is followed by the database name (zabbix_db
), the username (zabbix
), and the authentication method (in our case, md5
).
The host
keyword matches all the connections that are coming from TCP/IP (this includes SSL and non-SSL connections) followed by the database name (zabbix_db
), username (zabbix
), network, and mask of all the hosts that are allowed and the authentication method (in our case md5
).
- The network mask of the allowed hosts in our case should be a network mask because we need to allow the web interface (hosted on our web server) and the Zabbix server that is on a different dedicated server, for example,
10.6.0.0/24
(a small subnet) or even a large network. Most likely, the web interface as well as the Zabbix server will be in a different network, so make sure that you express all the network and relative masks here. - Finally, we can start our PosgreSQL server using the following command:
Alternatively, we can use this command:
To create a database, we need to be a postgres
user (or the user that in your distribution is running PostgreSQL). Create a user for the database (our Zabbix user) and log in as that user to import the schema with the relative data.
The code to import the schema is as follows:
Once we become postgres
users, we can create the database (in our example, it is zabbix_db
):
The database creation scripts are located in the /database/postgresql
folder of the extracted source files. They need to be installed exactly in this order:
Tip
The –h <DB-HOST-IP-ADDRESS>
option used on the psql
command will avoid the use of the local entry contained in the standard configuration file /var/lib/pgsql/9.4/data/pg_hba.conf
.
Now, finally, it is the time to start our Zabbix server and test the whole setup for our Zabbix server/database:
A quick check of the log file can give us more information about what is currently happening in our server. We should be able to get the following lines from the log file (the default location is /tmp/zabbix_server.log
):
Actually, the default log location is not the best ever as /tmp
will be cleaned up in the event of reboot and, for sure, the logs are not rotated and managed properly.
We can change the default location by simply changing an entry in /etc/zabbix_server.conf
. You can change the file as follows:
Create the directory structure with the following command from root:
Another important thing to change is logrotate
as it is better to have an automated rotation of our log file. This can be quickly implemented by adding the relative configuration in the logrotate
directory /etc/logrotate.d/
.
To do that, create the following file by running the command from the root account:
Use the following content:
Once those changes have been done, you need to restart your Zabbix server with the following command (run it using root):
Another thing to check is whether our server is running with our user:
The preceding lines show that zabbix_server
is running with the user 502
. We will go ahead and verify that 502
is the user we previously created:
The preceding lines show that all is fine. The most common issue normally is the following error:
There are different actors that cause this issue:
- Firewall (local on our servers or an infrastructure firewall)
- The
postgres
configuration - Wrong data in
zabbix_server.conf
Tip
We can try to isolate the problem by running the following command on the database server:
If we have a connection, we can try the same command from the Zabbix server; if it fails, it is better to check the firewall configuration. If we get the fatal identification-authentication failed error, it is better to check the pg_hba.conf
file.
Now, the second thing to check is the local firewall and then iptables
. You need to verify that the PostgreSQL port is open on the database server. If the port is not open, you need to add a firewall rule using the root account:
Now, it is time to check how to start and stop your Zabbix installation. The scripts that follow are a bit customized to manage the different users for the server and the agent.
Note
The following startup script works fine with the standard compilation without using a --
prefix or the zabbixsvr
user. If you are running on a different setup, make sure that you customize the executable location and the user:
For zabbix-server
, create the zabbix-server
file at /etc/init.d
with the following content:
The next parameter, zabbixsvr
, is specified inside the start()
function, and it determines which user will be used to run our Zabbix server:
In the preceding code, the user (who will own our Zabbix's server process) is specified inside the start
function:
Remember to change the ownership of the server log file and configuration file of Zabbix. This is to prevent a normal user from accessing sensitive data that can be acquired with Zabbix. Logfile
is specified as follows:
Here, inside the stop
function, we don't need to specify the user as the start/stop script runs from root, so we can simply use killproc $prog
as follows:
Note
The following startup script works fine with the standard compilation without using a --
prefix or the zabbix_usr
user. If you are running on a different setup, make sure that you customize the executable location and the user:
For zabbix_agent
, create the following zabbix-agent
file at /etc/init.d/zabbix-agent
:
The following zabbix_usr
parameter specifies the account that will be used to run Zabbix's agent:
The next command uses the value of the zabbix_usr
variable and permits us to have two different users, one for the server and one for the agent, preventing the Zabbix agent from accessing the zabbix_server.conf
file that contains our database password:
With that setup, we have the agent that is running with zabbix_usr
and the server with Unix accounts of zabbixsvr
:
Some considerations about the database
Zabbix uses an interesting way to keep the database the same size at all times. The database size indeed depends upon:
- The number of processed values per second
- The housekeeper settings
Zabbix uses two ways to store the collected data:
While on history, we will find all the collected data (it doesn't matter what type of data will be stored in history); trends will collect only numerical data. Its minimum, maximum, and average calculations are consolidated by hour (to keep the trend a lightweight process).
Tip
All the strings items, such as character, log, and text, do not correspond to trends since trends store only values.
There is a process called the housekeeper that is responsible for handling the retention against our database. It is strongly advised that you keep the data in history as small as possible so that you do not overload the database with a huge amount of data, and store the trends for as long as you want.
Now, since Zabbix will also be used for capacity planning purposes, we need to consider using a baseline and keeping at least a whole business period. Normally, the minimum period is one year, but it is strongly advised that you keep the trend history on for at least 2 years. These historical trends will be used during the business opening and closure to have a baseline and quantify the overhead for a specified period.
Note
If we indicate 0
as the value for trends, the server will not calculate or store trends at all. If history is set to 0
, Zabbix will be able to calculate only triggers based on the last value of the item itself as it does not store historical values at all.
The most common issue that we face when aggregating data is the presence of values influenced by positive spikes or fast drops in our hourly trends, which means that huge spikes can produce a mean value per hour that is not right.
Trends in Zabbix are implemented in a smart way. The script creation for the trend table is as follows:
As you can see, there are two tables showing trends inside the Zabbix database:
The first table, Trends
, is used to store the float value. The second table, trends_uint
, is used to store the unsigned integer. Both tables own the concept of keeping the following for each hour:
- Minimum value (
value_min
) - Maximum value (
value_max
) - Average value (
value_avg
)
This feature permits us to find out and display the trends graphically by using the influence of spikes and fast drops against the average value and understanding how and how much this value has been influenced. The other tables used for historical purposes are as follows:
history
: This is used to store numeric data (float)history_log
: This is used to store logs (for example, the text field on the PostgreSQL variable has unlimited length)history_str
: This is used to store strings (up to 255 characters)history_text
: This is used to store the text value (again, this is a text field, so it has unlimited length)history_uint
: This is used to store numeric values (unsigned integers)
Calculating the definitive database size is not an easy task because it is hard to predict how many items and the relative rate per second we will have on our infrastructure and how many events will be generated. To simplify this, we will consider the worst-case scenario, where we have an event generated every second.
In summary, the database size is influenced by:
- Items: The number of items in particular
- Refresh rate: The average refresh rate of our items
- Space to store values: This value depends on RDBMS
The space used to store the data may vary from database to database, but we can simplify our work by considering mean values that quantify the maximum space consumed by the database. We can also consider the space used to store values on history to be around 50 bytes per value, the space used by a value on the trend table to be around 128 bytes, and the space used for a single event to be normally around 130 bytes.
The total amount of used space can be calculated with the following formula:
Configuration + History + Trends + Events
Now, let's look into each of the components:
- Configuration: This refers to Zabbix's configuration for the server, the web interface, and all the configuration parameters that are stored in the database; this is normally around 10 MB
- History: The history component is calculated using the following formula:
- Trends: The trends component is calculated using the following formula:
- Events: The event component is calculated using the following formula:
Now, coming back to our practical example, we can consider 5,000 items to be refreshed every minute, and we want to have 7 days of retention; the used space will be calculated as follows:
Note
50 bytes is the mean value of the space consumed by a value stored on history.
Considering a history of 30 days, the result is the following:
- History will be calculated as:
- As we said earlier, to simplify, we will consider the worst-case scenario (one event per second) and will also consider keeping 5 years of events
- Events will be calculated using the following formula:
When we calculate an event, we have:
Note
130 bytes is the mean value for the space consumed by a value stored on events.
- Trends will be calculated using the following formula:
When we calculate trends, we have:
Note
128 bytes is the mean value of the space consumed by a value stored on trends.
The following table shows the retention in days and the space required for the measure:
The calculated size is not the initial size of our database, but we need to keep in mind that this one will be our target size after 5 years. We are also considering a history of 30 days, so keep in mind that this retention can be reduced if there are issues since the trends will keep and store our baseline and hourly trends.
The history and trend retention policy can be changed easily for every item. This means that we can create a template with items that have a different history retention by default. Normally, the history is set to 7 days, but for some kind of measure, such as in a web scenario or an other measures, we may need to keep all the values for more than a week. This permits us to change the value for each item.
In our example, we considered a worst-case scenario with 30 days of retention, but it is a piece of good advice to keep the history only for 7 days or even less in large environments. If we perform a basic calculation of an item that is updated every 60 seconds and has its history preserved for 7 days, it will generate (update interval) * (hours in a day) * (number of days in history) =60*24*7=10,080.
This mean that, for each item, we will have 10,080 lines in a week, and that gives us an idea of the number of lines that we will produce on our database.
The following screenshot represents the details of a single item:
Some considerations about housekeeping
Housekeeping can be quite a heavy process. As the database grows, housekeeping will require more and more time to complete his/her work. This issue can be sorted using the delete_history()
database function.
There is a way to deeply improve the housekeeping performance and fix this performance drop. The heaviest tables are: history
, history_uint
, trends
, and trends_uint
.
A solution is PostgreSQL table partitioning and the partitioning of the entire table on a monthly basis. The following figure displays the standard and nonpartitioned history table on the database:
The following figure shows how a partitioned history table will be stored in the database:
Partitioning is basically the splitting of a large logical table into smaller physical pieces. This feature can provide several benefits:
- The performance of queries can be improved dramatically in situations where there is heavy access of the table's rows in a single partition.
- The partitioning will reduce the index size, making it more likely to fit in the memory of the parts that are being used heavily.
- Massive deletes can be accomplished by removing partitions, instantly reducing the space allocated for the database without introducing fragmentation and a heavy load on index rebuilding. The
delete partition
command also entirely avoids the vacuum overhead caused by a bulk delete. - When a query updates or requires access to a large percentage of the partition, using a sequential scan is often more efficient than using the index with random access or scattered reads against that index.
All these benefits are only worthwhile when a table is very large. The strongpoint of this kind of architecture is that the RDBMS will directly access the needed partition, and the delete will simply be a delete of a partition. Partition deletion is a fast process and requires few resources.
Unfortunately, Zabbix is not able to manage the partitions, so we need to disable the housekeeping and use an external process to accomplish the housekeeping.
The partitioning approach described here has certain benefits compared to the other partitioning solutions:
- This does not require you to prepare the database to partition it with Zabbix
- This does not require you to create/schedule a cron job to create the tables in advance
- This is simpler to implement than other solutions
This method will prepare partitions under the desired partitioning schema with the following convention:
- Daily partitions are in the form of
partitions.tablename_pYYYYMMDD
- Monthly partitions are in the form of
partitions.tablename_pYYYYMM
All the scripts here described are available at https://github.com/smartmarmot/Mastering_Zabbix.
To set up this feature, we need to create a schema where we can place all the partitioned tables; then, within a psql
section, we need to run the following command:
Now, we need a function that will create the partition. So, to connect to Zabbix, you need to run the following code:
Note
Please ensure that your database has been set up with the user Zabbix. If you're using a different role/account, please change the last line of the script accordingly:
Now, we need a trigger connected to each table that we want to separate. This trigger will run an INSERT
statement, and if the partition is not ready or created yet, the function will create the partition right before the INSERT
statement:
At this point, we miss only the housekeeping function that will replace the one built in Zabbix and disable Zabbix's native one. The function that will handle housekeeping for us is as follows:
Now you have the housekeeping ready to run. To enable housekeeping, we can use crontab by adding the following entries:
Those two tasks should be scheduled on the database server's crontab. In this example, we will keep the history of 7 days and trends of 24 months.
Now, we can finally disable the Zabbix housekeeping. To disable the housekeeping on Zabbix 2.4, the best way is use the web interface by selecting Administration | General | Housekeeper, and there, you can disable the housekeeping for the Trends and History tables, as shown in the following screenshot:
Now the built-in housekeeping is disabled, and you should see a lot of improvement in the performance. To keep your database as lightweight as possible, you can clean up the following tables:
acknowledges
alerts
auditlog
events
service_alarms
Once you have chosen your own retention, you need to add a retention policy; for example, in our case, it will be 2 years of retention. With the following crontab entries, you can delete all the records older than 63072000
(2 years expressed in seconds):
To disable housekeeping, we need to drop the triggers created:
All those changes need to be tested and changed/modified as they fit your setup. Also, be careful and back up your database.
The web interface installation is quite easy; there are certain basic steps to execute. The web interface is completely written in PHP, so we need a web server that supports PHP; in our case, we will use Apache with the PHP support enabled.
The entire web interface is contained inside the php
folder at frontends/php/
that we need to copy on our htdocs
folder:
Use the following commands to copy the folders:
Note
Be careful—you might need proper rights and permissions as all those files are owned by Apache and they also depend on your httpd
configuration.
The web wizard – frontend configuration
Now, from your web browser, you need to open the following URL:
The first screen that you will meet is a welcome page; there is nothing to do there other than to click on Next. When on the first page, you may get a warning on your browser that informs you that the date / time zone is not set. This is a parameter inside the php.ini
file. All the possible time zones are described on the official PHP website at http://www.php.net/manual/en/timezones.php.
The parameter to change is the date/time zone inside the php.ini
file. If you don't know the current PHP configuration or where it is located in your php.ini
file, and you need detailed information about which modules are running or the current settings, then you can write a file, for example, php-info.php
, inside the Zabbix directory with the following content:
Now point your browser to http://your-zabbix-web-frontend/zabbix/php-info.php
.
You will have your full configuration printed out on a web page. The following screenshot is more important; it displays a prerequisite check, and, as you can see, there is at least one prerequisite that is not met:
On standard Red-Hat/CentOS 6.6, you only need to set the time zone; otherwise, if you're using an older version, you might have to change the following prerequisites that most likely are not fulfilled:
Most of these parameters are contained inside the php.ini
file. To fix them, simply change the following options inside the /etc/php.ini
file:
To solve the issue of the missing library, we need to install the following packages:
php-xml
php-bcmath
php-mbstring
php-gd
We will use the following command to install these packages:
The whole list or the prerequisite list is given in the following table:
Every time you change a php.ini
file or install a PHP extension, the httpd
service needs a restart to get the change. Once all the prerequisites are met, we can click on Next and go ahead. On the next screen, we need to configure the database connection. We simply need to fill out the form with the username, password, IP address, or hostname and specify the kind of database server we are using, as shown in the following screenshot:
If the connection is fine (this can be checked with a test connection), we can proceed to the next step. Here, you only need to set the proper database parameters to enable the web GUI to create a valid connection, as shown in the following screenshot:
There is no check for the connection available on this page, so it is better to verify that it is possible to reach the Zabbix server from the network. In this form, it is necessary to fill Host (or IP address) of our Zabbix server. Since we are installing the infrastructure on three different servers, we need to specify all the parameters and verify that the Zabbix server port is available on the outside of the server.
Once we fill this form, we can click on Next. After this, the installation wizard prompts us to view Pre-Installation summary, which is a complete summary of all the configuration parameters. If all is fine, just click on Next; otherwise, we can go back and change our parameters. When we go ahead, we see that the configuration file has been generated (for example, in this installation the file has been generated in /usr/share/zabbix/conf/zabbix.conf.php
).
It can happen that you may get an error instead of a success notification, and most probably, it is about the directory permission on our conf
directory at /usr/share/zabbix/conf
. Remember to make the directory writable to the httpd
user (normally, Apache is writable) at least for the time needed to create this file. Once this step is completed, the frontend is ready and we can perform our first login.
Capacity planning with Zabbix
Quite often, people mix up the difference between capacity planning and performance tuning. Well, the scope of performance tuning is to optimize the system you already have in place for better performance. Using your current performance acquired as a baseline, capacity planning determines what your system needs and when it is needed. Here, we will see how to organize our monitoring infrastructure to achieve this goal and provide us a with useful baseline. Unfortunately, this chapter cannot cover all the aspects of this argument; we should have one whole book about capacity planning, but after this section, we will look at Zabbix with a different vision and will be aware of what to do with it.
Zabbix is a good monitoring system because it is really lightweight. Unfortunately, every observed system will spend a bit of its resources to run the agent that acquires and measures data and metrics against the operating system, so it is normal if the agent introduces a small (normally very small) overhead on the guest system. This is known as the observer effect. We can only accept this burden on our server and be aware that this will introduce a slight distortion in data collection, bearing in mind that we should keep it lightweight to a feasible extent while monitoring the process and our custom checks.
The Zabbix agent's job is to collect data periodically from the monitored machine and send metrics to the Zabbix server (that will be our aggregation and elaboration server). Now, in this scenario, there are certain important things to consider:
- What are we going to acquire?
- How are we going to acquire these metrics (the way or method used)?
- What is the frequency with which this measurement is performed?
Considering the first point, it is important to think what should be monitored on our host and the kind of work that our host will do; or, in other words, what function it will serve.
There are some basic metrics of operating systems that are, nowadays, more or less standardized, and those are: the CPU workload, percentage of free memory, memory usage details, usage of swap, the CPU time for a process, and all this family of measure, all of them are built-in on the Zabbix agent.
Having a set of items with built-in measurement means that they are optimized to produce as little workload as possible on the monitored host; the whole of Zabbix's agent code is written in this way.
All the other metrics can be divided by the service that our server should provide.
Note
Here, templates are really useful! (Also, it is an efficient way to aggregate our metrics by type.)
Doing a practical example and considering monitoring the RDBMS, it will be fundamental to acquire:
- All the operating system metrics
- Different custom RDBMS metrics
Our different custom RDBMS metrics can be: the number of users connected, the use of cache systems, the number of full table scans, and so on.
All those kinds of metrics will be really useful and can be easily interpolated and compared against the same time period in a graph. Graphs have some strongpoints:
- They are useful to understand (also from the business side)
- It is often nice to present and integrate on slides to enforce our speech
Coming back to our practical example, well, currently we are acquiring data from our RDBMS and our operating system. We can compare the workload of our RDBMS and see how this reflects the workload against our OS. Now?
Most probably, our core business is the revenue of a website, merchant site, or a web application. We assume that we need to keep a website in a three-tier environment under control because it is quite a common case. Our infrastructure will be composed of the following actors:
- A web server
- An application server
- The RDBMS
In real life, most probably, this is the kind of environment that Zabbix will be configured in. We need to be aware that every piece and every component that can influence our service should be measured and stored inside our Zabbix monitoring system. Generally, we can consider it to be quite normal to see people with a strong system administration background to be more focused on operating system-related items as well. We also saw people writing Java code that needs to be concentrated on some other obscure measure, such as the number of threads. The same kind of reasoning can be done if the capacity planner talks with a database administrator or a specific guy from every sector.
This is a quite important point because the Zabbix implementer should have a global vision and should remember that, when buying new hardware, the interface will most likely be a business unit.
This business unit very often doesn't know anything about the number of threads that our system can support but will only understand customer satisfaction, customer-related issues, and how many concurrent users we can successfully serve.
Having said that, it is really important to be ready to talk in their language, and we can do that only if we have certain efficient items to graph.
Now, if we look at the whole infrastructure from a client's point of view, we can think that if all our pages are served in a reasonable time, the browsing experience will be pleasant.
Our goal in this case is to make our clients happy and the whole infrastructure reliable. Now, we need to have two kinds of measures:
- The one felt from the user's side (the response time of our web pages)
- Infrastructure items related to it
We need to quantify the response time related to the user's navigation, and we need to know how much a user can wait in front of a web page to get a response, keeping in mind that the whole browsing experience needs to be pleasant. We can measure and categorize our metrics with these three levels of response time:
- 0.2 seconds: It gives the feel of an instantaneous response. The user feels the browser reaction was caused by him/her and not from a server with a business logic.
- 1-2 seconds: The user feels that the browsing is continuous, without any interruption. The user can move freely rather than waiting for the pages to load.
- 10 seconds: The likes for our website will drop. The user will want better performance and can definitely be distracted by other things.
Now, we have our thresholds and we can measure the response of a web page during normal browsing, and in the meantime, we can set a trigger level to warn us when the response time is more than two seconds for a page.
Now we need to relate that to all our other measures: the number of users connected, the number of sessions in our application server, and the number of connections to our database. We also need to relate all our measures to the response time and the number of users connected. Now, we need to measure how our system is serving pages to users during normal browsing.
This can be defined as a baseline. It is where we currently are and is a measure of how our system is performing under a normal load.
Now that we know how we are, and we have defined the threshold for our goal, along with the pleasant browsing experience, let's move forward.
We need to know which one is our limit and, more importantly, how the system should reply to our requests. Since we can't hire a room full of people that can click on our website like crazy, we need to use software to simulate this kind of behavior. There is interesting open source software that does exactly this. There are different alternatives to choose from—one of them is Siege (https://www.joedog.org/2013/07/siege-3-0-3-url-encoding/).
Seige permits us to simulate a stored browser history and load it on our server. We need to keep in mind that users, real users, will never be synchronized between them. So, it is important to introduce a delay between all the requests. Remember that if we have a login, then we need to use a database of users because application servers cache their object, and we don't want to measure how good the process is in caching them. The basic rule is to create a real browsing scenario against our website, so users who login can log out with just a click and without any random delay.
The stored scenarios should be repeated x times with a growing number of users, meaning Zabbix will store our metrics, and, at a certain point, we will pass our first threshold (1-2 seconds per web page). We can go ahead until the response time reaches the value of our second threshold. There is no way to see how much load our server can take, but it is well known that appetite comes with eating, so I will not be surprised if you go ahead and load your server until it crashes one of the components of your infrastructure.
Drawing graphs that relate the response time to the number of users on a server will help us to see whether our three-tier web architecture is linear or not. Most probably, it will grow in a linear pattern until a certain point. This segment is the one on which our system is performing fine. We can also see the components inside Zabbix, and from this point, we can introduce a kind of delay and draw some conclusions.
Now, we know exactly what to expect from our system and how the system can serve our users. We can see which component is the first that suffers the load and where we need to plan a tuning.
Capacity planning can be done without digging and going deep into what to optimize. As we said earlier, there are two different tasks—performance tuning and capacity planning—that are related, of course, but different. We can simply review our performance and plan our infrastructure expansion.
Tip
A planned hardware expansion is always cheaper than an unexpected, emergency hardware improvement.
We can also perform performance tuning, but be aware that there is a relation between the time spent and the performance obtained, so we need to understand when it is time to stop our performance tuning, as shown in the following graph:
One of the most important features of Zabbix is the capacity to store historical data. This feature is of vital importance during the task of predicting trends. Predicting our trends is not an easy task and is important considering the business that we are serving, and when looking at historical data, we should see whether there are repetitive periods or whether there is a sort of formula that can express our trend.
For instance, it is possible that the online web store we are monitoring needs more and more resources during a particular period of the year, for example, close to public holidays if we sell travels. While doing a practical example, you can consider the used space on a specific server disk. Zabbix gives us the export functionality to get our historical data, so it is quite easy to import them in a spreadsheet. Excel has a curve fitting option that will help us a lot. It is quite easy to find a trend line using Excel that will tell us when we are going to exhaust all our disk space. To add a trend line into Excel, we need to create, at first, a "scatter graph" with our data; here, it is also important to graph the disk size. After this, we can try to find a mathematical equation that is more close to our trend. There are different kinds of formulae that we can choose; in this example, I used a linear equation because the graphs are growing with a linear relation.
Note
The trend line process is also known as the curve fitting process.
The graph that comes out from this process permits us to know, with a considerable degree of precision, when we will run out of space.
Now, it is clear how important it is to have a considerable amount of historical data, bearing in mind the business period and how it influences data.
Tip
It is important to keep track of the trend/regression line used and the relative formula with the R-squared value so that it is possible to calculate it with precision and, if there aren't any changes in trends, when the space will be exhausted.
The graph obtained is shown in the following screenshot, and from this graph, it is simple to see that if the trends don't change, we are going to run out of space on June 25, 2015: