Fault Tolerance Features
There are multiple levels of Fault Tolerance built into all levels of ELM Enterprise Manager making it one of the most robust log management and server monitoring solutions available. We take collection, protection and validation of data seriously and you will see that in the variety of approaches we’ve built into our products.
Agent Level
At the agent or monitored system level, ELM is designed with two different levels of fault tolerance protection.
Caching– When Service Agents are unable to connect to an ELM Server they will cache data until a connection is re-established to maintain data collection of all events configured for monitoring. The cache size can be configured as needed.
Point to Point Verification
ELM includes monitoring features that go above and beyond a simple PING status indicator. An Event Writer used in conjunction with Correlation Views can verify the complete cycle of Agents collecting events and sending them to the ELM Server as expected. If a predetermined stop or start event is not detected within a specified interval, actions such as a notification, dashboard alert, or a restart script can be implemented.
Failover Database
ELM is deployed using both a Primary and Failover database strategy. The Primary database stores the most recent event, performance, SNMP and Syslog data.
The Failover database prevents loss of monitoring and alerting while the Primary is unavailable or under maintenance for example. Once a connection to the Primary database is re-established, data from the Failover automatically populates the Primary, merging seamlessly so that all views and reports perform as expected without gaps.