How to generate value from data collections

Péter Ertner

Feb 10, 2023

The ElasticSearch and its ecosystem are primarily optimized for analyzing large time series data and performing free-text searches. Its use does not replace the use of traditional relational database managers or data warehouses, but enables a different type of real-time data analysis and the development of an alert system by loading data from those sources.

The two common use cases in this type of data analysis are the collection and management of unified technical and operational data, and the logging and analysis of business data. This two-part article will discuss these two areas from a business and project management perspective.

Collection and management of technical and operational data

By collecting operational and technical data, we can gather real-time information about the infrastructure and the applications running on it. Using these time series data, we can generate alerts, analyze problematic situations after the fact with deep-dive analysis, and make forecasts. By using the built-in dashboard component, we can create an interactive monitoring interface that can also be augmented with intuitive graphical components.

By using the ElasticSearch ecosystem, we can integrate many infrastructure components and application development and running frameworks into the data collection with simple configuration. These ready-made components load the data into the ElasticSearch database in JSON format with a unified schema data (Elastic Common Schema). In this way, data from different sources can be viewed on a general interface in Kibana and compared with each other over time.

Collection planning and implementation – Infrastructure elements

IIn the first step, the infrastructural elements of the existing systems and the used frameworks (Java, Node.js, Oracle RDBMS, etc.) must be assessed. The possible logging options must be determined, as well as a priority order.

The basic philosophy of ElasticSearch is to store information from different systems in a central database; thus, they can even be compared. Accordingly, a central ElasticSearch must be set up. The database manager works in a distributed manner so that it can be easily scaled as the amount of data increases.

The ElasticSearch ecosystem includes built-in log collection components for more than a hundred tools and frameworks, which after installation and minimal configuration, are capable of collecting information and loading it into the central ElasticSearch. Java, Node.js, Python, PHP applications, various application servers (Tomcat, WebLogic), network devices (Cisco, Palo Alto) all have factory data collection components, and standard protocols have also been implemented (OpenTelemetry).

The loaded information is stored in a uniform scheme (Elastic Common Schema), so it can be viewed and managed immediately on the standard interfaces built into the Kibana display interface.

After analyzing the collected data, it is worth setting alarms in this phase and creating independent overview interfaces (dashboards) for each operational area. This can be done by using a graphical user interface in Kibana.

It makes sense to continuously increase the number of components connected to the system iteratively.

Introduction of application performance measurement

After collecting the infrastructural log files, the next step is to monitor the internal operation of the applications, the introduction of APM (Application Performance Monitoring).

Elastic APM is a tool which helps you monitor the internal operation of your applications, from the user interface to the database or even between systems. This is possible because the monitoring data from different components are comparably entered into a common database.

With the help of Kibana's built-in interfaces, the loaded data can be used to analyze the course of transactions, the time distribution and the speed of requests and responses, and any problems that occur. In this case, you can also drill down into the data, so the tool is also suitable for root-cause analysis, which can help developers solve problems.

The APM toolbox also includes a client side browser plugin (RUM - Real User Monitoring), which can be used to analyze the progress of individual requests and responses from the browser to the database.

The tool has a component for receiving data via common protocols, such as Prometheus or Istio, so network traffic can also be included in the analysis.

As in the case of the infrastructure log, it is useful proceeding iteratively and connecting the systems one by one.

Implementation of rules, alerts, and reports

After completing the data collection, we have enough information to create rules after monitoring and tuning the systems.

Rules can be created in the ElasticSearch database manager, and alarms are generated if violated. These can be email messages, messages sent in common messaging applications, or Microsoft Teams, but they can also result in a Jira issue. The graphic components appearing on the Kibana interface can also indicate problematic situations with color and other highlights.

The increased amount of data makes it necessary to introduce the data life cycle at this point. Based on the set rules, ElasticSearch is able to move the data to cheaper hardware devices based on their age and delete outdated data as needed. Automatic aggregation can also be set for old data, so trends can be analyzed later, even after individual data has been deleted. Summary interfaces (dashboards) created on the Kibana interface can be sent regularly and automatically in PDF format to the set email addresses, so they can also function as executive summaries.

Introducing machine learning

The ElasticSearch database manager has a built-in machine learning module. It can be used to analyze large amounts of time series data and automatically search for anomalies. By using this, in addition to the manually set rules and limits, we can automatically receive alarms in the event of abnormal operating parameters. Using the collected information, the machine learning module independently determines what parameters a system typically has in a given period and compares the current state to it. It does not cause problems, e.g., the recognition of regular fluctuations within a week or day.

Based on the collected information, the machine learning module can make predictions, so the critical points of the systems can be predicted.

To be continued!