The Mystery of the Daily Server Crashes

Introduction

In the world of systems administration, few things are more frustrating than recurring server crashes with no obvious cause. This is the story of a production mail server that began experiencing mysterious daily crashes, each requiring manual intervention and causing significant service disruption. What appeared to be a complex infrastructure failure turned out to be a classic case of misconfiguration - a critical service dependency that was referenced but never properly installed. This essay explores the technical investigation, root cause analysis, and ultimate resolution of a problem that highlights the importance of proper service configuration and comprehensive monitoring.

The Problem: A Pattern of Instability

The server in question was experiencing a troubling pattern of crashes that suggested systemic issues rather than random failures. The reboot history painted a clear picture of deteriorating stability:

Day 5: Manual reboot required after complete system freeze
Day 3: Unexpected reboot
Day 2: Unexpected reboot
Day 1 (afternoon): System ran for only 20 hours before failure
Day 1 (morning): System ran for only 11 hours before failure

The frequency and consistency of these crashes indicated that something fundamental was wrong with the server’s configuration or a critical service was failing repeatedly. The most recent incident was particularly severe: the server became completely unresponsive at approximately 6:13 PM, remaining in this frozen state for over 16 hours until it was manually rebooted the following morning.

What made this situation especially concerning was the nature of the freeze. The server didn’t crash with a kernel panic or run out of memory in a way that would trigger the Out-of-Memory (OOM) killer. Instead, it simply stopped responding to all requests, including local monitoring checks. The monitoring agent, which was running on the same physical machine as the monitoring server (both on localhost), reported that it could not communicate with itself - a clear indication that the system was so overloaded that even local inter-process communication had failed.

The Investigation: Following the Evidence

The investigation began with a systematic examination of the server’s logs and configuration. The first step was to check for obvious causes: kernel panics, OOM kills, hardware errors, or disk failures. However, none of these common culprits were present. The kernel logs showed a normal boot sequence with no panics. There were no OOM killer messages in the system logs, no hardware errors reported by dmesg, and the disk subsystem appeared healthy.

The breakthrough came when examining the system logs from the hours leading up to the most recent crash. Starting at 6:00 PM, a pattern emerged that repeated every few seconds:

Nov 18 18:00:09 mail-server postfix/smtpd[632203]: warning: connect to Milter service inet:localhost:8891: Connection refused
Nov 18 18:00:16 mail-server postfix/smtpd[630076]: warning: connect to Milter service inet:localhost:8891: Connection refused

These messages continued relentlessly throughout the next 30 minutes and likely beyond, though the logs stopped being written as the system became increasingly unresponsive. The errors indicated that Postfix, the mail transfer agent, was attempting to connect to a “milter” service on port 8891 but was consistently being refused.

At 6:13 PM - just 13 minutes after these errors began appearing - the monitoring system reported a critical alert: “Agent is not available (for 3m).” This timing was not coincidental. The monitoring agent had become unreachable because the server was so overloaded that it could no longer respond to monitoring queries, even from localhost.

Understanding the Root Cause: The Milter Problem

To understand what was happening, it’s essential to understand what a milter is and how Postfix uses it. A “milter” (mail filter) is a protocol and service that allows external programs to inspect and modify email messages as they pass through a mail server. Common uses include virus scanning, spam filtering, and content policy enforcement. ClamAV, a popular open-source antivirus solution, provides a milter service that integrates with Postfix to scan all incoming and outgoing email for malware.

The Postfix configuration on the affected server included these critical settings:

smtpd_milters = inet:localhost:8891
non_smtpd_milters = inet:localhost:8891
milter_connect_timeout = 30s

These settings told Postfix to route all mail through a milter service listening on TCP port 8891 on localhost, and to wait up to 30 seconds for the milter to respond before timing out. The problem was that no service was actually listening on port 8891. When the command netstat -tulpn | grep 8891 was executed, it returned nothing - the port was completely unused.

This created a cascading failure scenario:

Initial Connection Attempt: When an email arrived, Postfix would attempt to connect to the milter on port 8891.
Connection Refused: Because no service was listening, the connection was immediately refused by the operating system.
Retry and Timeout Logic: Despite the immediate refusal, Postfix’s milter handling code would wait for the configured timeout period (30 seconds) before giving up on each connection.
Connection Accumulation: With continuous incoming mail traffic - including legitimate email, authentication attempts from customers, and unfortunately, a sustained brute-force attack from multiple IP addresses - new connections were arriving faster than old ones were timing out.
Resource Exhaustion: Each hanging connection consumed system resources including file descriptors, memory, and process slots. As these accumulated over minutes and hours, the server’s available resources dwindled.
System Overload: Eventually, the server reached a state where it had so many hung processes and exhausted so many resources that it could no longer handle even basic operations. This included the monitoring agent’s attempts to communicate with the monitoring server on localhost, leading to the monitoring alert.
Complete Freeze: In the final stage, the server became completely unresponsive, unable to accept new SSH connections, unable to write logs, and unable to perform any meaningful work.

The reason this happened daily and with somewhat predictable timing relates to the server’s activity patterns. The server ran various scheduled tasks (cron jobs) and experienced natural traffic patterns throughout the day. The crashes tended to occur during periods of higher activity - around 6:00 PM in several cases - when the combination of scheduled tasks, regular mail traffic, and the accumulation of hung milter connections reached a critical threshold.

The Deeper Configuration Problem

Further investigation revealed why the milter service wasn’t running despite being referenced in the Postfix configuration. The ClamAV daemon (clamd) was installed and had been running, but the ClamAV milter service itself was not installed. This is a common misconfiguration that can occur during server setup, particularly when:

Configuration management tools or setup scripts copy configurations from other servers without verifying all dependencies
Services are upgraded or migrated without ensuring all components are properly installed
Documentation is followed incompletely, missing critical installation steps

When the ClamAV milter package was finally installed, it revealed another layer of misconfiguration. The default configuration had the milter listening on a Unix socket (/var/run/clamav/clamav-milter.ctl) rather than a TCP port. This meant that even after installation, the milter still wouldn’t be accessible on port 8891 as Postfix expected.

The milter logs confirmed this misconfiguration:

WARNING: No clamd server appears to be available

The Solution: Proper Service Configuration

The resolution involved several carefully sequenced steps, each addressing a specific aspect of the problem:

Step 1: Immediate Stabilization

The first priority was to prevent further crashes while the root cause was being addressed. This was accomplished by temporarily disabling the milter integration in Postfix:

postconf -e "smtpd_milters ="
postconf -e "non_smtpd_milters ="
systemctl reload postfix

These commands removed the milter configuration from Postfix, allowing mail to flow without attempting to connect to the non-existent service. This immediately stabilized the server, though it left mail unscanned for viruses temporarily.

Step 2: Installing and Starting ClamAV Services

With the server stabilized, the next step was to properly install all required components:

apt install clamav-milter clamav-daemon
systemctl enable clamav-daemon
systemctl start clamav-daemon

The ClamAV daemon started successfully and began loading its virus definition database. This process can take 30-60 seconds as the daemon initializes and prepares to scan mail.

Step 3: Configuring the Milter for TCP

The ClamAV milter configuration file (/etc/clamav/clamav-milter.conf) needed to be modified to listen on TCP port 8891 instead of a Unix socket. The critical change was:

Before:

MilterSocket /var/run/clamav/clamav-milter.ctl

After:

MilterSocket inet:8891@localhost

This configuration change told the milter to bind to TCP port 8891 on the localhost interface, exactly as Postfix expected. After making this change, the milter service was restarted:

systemctl restart clamav-milter

Verification confirmed the service was now properly listening:

netstat -tulpn | grep 8891
tcp        0      0 127.0.0.1:8891          0.0.0.0:*               LISTEN      95325/clamav-milter

Step 4: Re-enabling Milter Integration

With the milter service now properly configured and running, Postfix could be reconfigured to use it:

postconf -e "smtpd_milters = inet:localhost:8891"
postconf -e "non_smtpd_milters = inet:localhost:8891"
systemctl reload postfix

Step 5: Ensuring Persistence

To prevent the problem from recurring after a server reboot, both ClamAV services were enabled to start automatically:

systemctl enable clamav-daemon
systemctl enable clamav-milter

This ensured that future reboots would bring up the services in the correct order with the correct configuration.

Implementing Preventive Measures

Fixing the immediate problem was only part of the solution. To prevent similar issues in the future and to detect problems before they cause crashes, several monitoring and alerting improvements were implemented:

Enhanced System Monitoring

Three complementary monitoring systems were put in place:

sysstat: This system activity reporter was enabled to collect historical data about CPU usage, memory consumption, disk I/O, and network activity at regular intervals. This data is invaluable for post-incident analysis and for identifying trends that might predict future problems.
atop: This advanced performance monitor captures detailed system and process-level information, including which processes are consuming resources, what state they’re in, and how they’re interacting with the system. Unlike basic monitoring tools, atop can show historical data and help reconstruct what happened during a period of system stress.
Custom Freeze Detection Script: A simple but effective monitoring script was created to run every 5 minutes via cron:

#!/bin/bash
LOGFILE="/var/log/freeze-check.log"
date >> "$LOGFILE"
echo "Load: $(cat /proc/loadavg)" >> "$LOGFILE"
echo "Memory:" >> "$LOGFILE"
free -m >> "$LOGFILE"
echo "Top CPU:" >> "$LOGFILE"
ps aux --sort=-%cpu | head -5 >> "$LOGFILE"
echo "Top MEM:" >> "$LOGFILE"
ps aux --sort=-%mem | head -5 >> "$LOGFILE"
echo "---" >> "$LOGFILE"

This script logs key system metrics that can help identify what was happening before a freeze. Because it runs frequently, it can capture the system state even if the freeze prevents manual investigation.

Service Health Monitoring

The monitoring configuration was enhanced to include:

Alerts for service failures (when clamav-daemon or clamav-milter stop)
Alerts for high connection counts (indicating potential issues)
Alerts for sustained high load averages
Alerts for memory usage exceeding 90%

Documentation

Complete documentation was created covering:

The service dependencies between Postfix and ClamAV
The correct configuration for both services
Troubleshooting procedures if mail flow issues occur
The location and interpretation of relevant log files

Lessons Learned

This incident provides several valuable lessons for systems administrators and DevOps engineers:

1. Service Dependencies Must Be Explicit and Verified

Configuration management should include explicit checks that all service dependencies are installed and running. In this case, Postfix was configured to depend on a milter service that wasn’t actually present. Modern infrastructure-as-code tools should include validation steps that verify not just that configuration files are correct, but that the services they reference actually exist and are functioning.

2. Monitoring Must Cover Service-to-Service Communication

It’s not enough to monitor that individual services are running. Monitoring should also verify that services can communicate with their dependencies. In this case, a simple check attempting to connect to port 8891 would have immediately identified the problem.

3. Timeout Settings Can Hide or Exacerbate Problems

The 30-second timeout for milter connections meant that the problem manifested as a slow accumulation of hung processes rather than immediate, obvious failures. Shorter timeouts might have prevented the complete system freeze, though they would have still resulted in mail delivery issues. The lesson is that timeout values should be chosen carefully based on expected service behavior and system capacity.

4. Graceful Degradation Is Important

Postfix’s milter_default_action setting was set to accept, meaning that if the milter was unavailable, mail would be accepted anyway. While this prevented mail from being rejected, it also meant that the system tried to contact the non-existent milter for every message rather than failing fast and moving on. A more sophisticated approach might detect repeated failures and temporarily disable the failing dependency.

5. Log Analysis Is Essential

The recurring nature of the crashes provided valuable data. By examining logs from multiple crash events and looking for patterns, the investigation could focus on events that consistently preceded failures. This pattern-matching approach is often more valuable than looking at a single incident in isolation.

6. Test in Stages

The resolution was implemented in stages, with verification at each step:

First, stabilize the server (disable milter)
Then, install the missing service
Next, configure it correctly
Finally, re-enable the integration

This staged approach meant that if something went wrong, it would be clear which change caused the problem. It also allowed for verification that each component was working before moving to the next step.

Conclusion

This case illustrates how a seemingly simple misconfiguration - a missing mail filter service - can cause complete system failures through resource exhaustion. The problem went undetected for an extended period because it manifested as accumulated resource consumption rather than an immediate, obvious error.

The resolution required not just fixing the immediate technical problem, but implementing comprehensive monitoring to prevent recurrence and detect similar issues in the future. The incident highlights the importance of:

Proper service configuration management
Comprehensive dependency checking
Proactive monitoring and alerting
Thorough documentation
Systematic troubleshooting methodologies

Most importantly, this case demonstrates that even complex, recurring system failures often have straightforward root causes. Patient, methodical investigation following the evidence from logs and monitoring systems can uncover these causes and lead to permanent solutions.

The server has been stable since the fix was implemented, with no crashes observed. Mail is being properly scanned for viruses, monitoring systems are capturing detailed metrics, and the documentation will help prevent similar issues in the future. What began as a frustrating pattern of daily crashes was ultimately resolved through careful analysis, proper configuration, and attention to the fundamentals of system administration.

This incident serves as a reminder that in systems administration, the most sophisticated tools and architectures can be undermined by simple configuration oversights. Success lies not just in deploying complex systems, but in ensuring that every component, every dependency, and every integration point is properly configured, monitored, and documented.

# The Mystery of the Daily Server Crashes