The Intermittent SSL Certificate Error: A 3-Hour Debugging Journey

The Problem That Shouldn’t Exist

It started with a ticket from our payment gateway provider. Users were experiencing intermittent SSL certificate errors when trying to access one of our reservation systems. The frustrating part? When I tested it myself, sometimes it worked perfectly, and sometimes I got a “certificate has expired” error. Refresh the page - it works. Refresh again - error. This was going to be one of those days.

First Attempt: Check the Obvious

My first instinct was to check the actual certificate on the HAProxy server. I ran:

openssl x509 -in /etc/ssl/private/xxxxxxxx.nl.pem -noout -dates

Result: Certificate valid until February 2026. Everything looked fine.

“Maybe it’s cached somewhere?” I thought. So I reloaded HAProxy:

systemctl reload haproxy

Tested again. Still intermittent. Half the requests worked, half failed with an expired certificate error showing it expired on November 16, 2025.

Second Attempt: The Backend Servers

Since HAProxy proxies requests to backend servers, maybe one of them had an expired certificate? I checked both web servers (web1 at 10.0.0.2 and web2 at 10.0.0.6):

openssl s_client -connect 10.0.0.2:443 -servername xxxxxxxx.nl
openssl s_client -connect 10.0.0.6:443 -servername xxxxxxxx.nl

Result: Both showed self-signed certificates valid until 2035. Definitely not the problem.

At this point, I was confused. The certificates were all valid, but users were definitely seeing expired certificates.

Third Attempt: Maybe It’s Cached in HAProxy’s Memory?

HAProxy has an admin socket that lets you query its internal state. I checked what certificates it had loaded:

echo "show ssl cert /etc/ssl/private/xxxxxxxx.nl.pem" | socat stdio /run/haproxy/admin.sock

Result: HAProxy showed it had loaded the valid certificate (expires February 2026). No expired certificate in sight.

“Maybe HAProxy just needs a hard restart?” I thought. Reloading wasn’t enough, so I did a full restart:

systemctl restart haproxy

Waited a minute. Tested again. Still intermittent! This made no sense. A restart should have cleared any cached state.

Fourth Attempt: Nuclear Option - Clear Everything

At this point, I was getting desperate. I stopped HAProxy completely and cleared all its cache directories:

systemctl stop haproxy
rm -rf /var/lib/haproxy/*
rm -rf /run/haproxy/*
systemctl start haproxy

Surely this would fix it? Fresh start, clean slate, no cached anything.

Tested 20 times in a row. Still got the expired certificate on about 10 of those attempts.

I was losing my mind. How could a completely fresh HAProxy instance, with only valid certificates on disk, be serving an expired certificate that I couldn’t find anywhere?

Fifth Attempt: Deep Filesystem Search

Maybe there was an old certificate file somewhere that HAProxy was reading? I searched the entire filesystem:

find /etc -name "*.pem" -type f 2>/dev/null | while read cert; do
    if openssl x509 -in "$cert" -noout -enddate 2>/dev/null | grep -q "Nov 16"; then
        echo "FOUND: $cert"
    fi
done

Result: Found several expired certificates in /etc/letsencrypt/archive/ directories for OTHER domains, but nothing for the domain in question. And besides, those archive files shouldn’t even be read by HAProxy - it only reads from /etc/ssl/private/.

The Pattern That Changed Everything

After nearly 3 hours of debugging, I decided to test more systematically. I ran 20 consecutive tests and logged the results:

for i in {1..20}; do
    echo -n "Test $i: "
    openssl s_client -connect xxxxxxxx.nl:443 < /dev/null 2>&1 | openssl x509 -noout -dates
done

Test 1: notAfter=Nov 16 12:53:48 2025 GMT
Test 2: notAfter=Feb 23 11:04:34 2026 GMT
Test 3: notAfter=Feb 23 11:04:34 2026 GMT
Test 4: notAfter=Nov 16 12:53:48 2025 GMT
Test 5: notAfter=Feb 23 11:04:34 2026 GMT
Test 6: notAfter=Nov 16 12:53:48 2025 GMT
Test 7: notAfter=Feb 23 11:04:34 2026 GMT
Test 8: notAfter=Nov 16 12:53:48 2025 GMT
Test 9: notAfter=Nov 16 12:53:48 2025 GMT
Test 10: notAfter=Feb 23 11:04:34 2026 GMT
Test 11: notAfter=Nov 16 12:53:48 2025 GMT
Test 12: notAfter=Feb 23 11:04:34 2026 GMT
Test 13: notAfter=Nov 16 12:53:48 2025 GMT
Test 14: notAfter=Nov 16 12:53:48 2025 GMT
Test 15: notAfter=Feb 23 11:04:34 2026 GMT
Test 16: notAfter=Nov 16 12:53:48 2025 GMT
Test 17: notAfter=Feb 23 11:04:34 2026 GMT
Test 18: notAfter=Nov 16 12:53:48 2025 GMT
Test 19: notAfter=Nov 23 11:04:34 2026 GMT
Test 20: notAfter=Feb 23 11:04:34 2026 GMT

The results showed a clear pattern:

10 requests: Valid certificate (expires Feb 23 2026)
10 requests: Expired certificate (expired Nov 16 2025)

It was exactly 50/50. That’s too consistent to be random. This felt like… load balancing?

Then I tested the backend servers directly, 20 times each:

# Test web1
for i in {1..20}; do
    openssl s_client -connect 10.0.0.2:443 -servername xxxxxxxx.nl
done

Both backend servers returned valid certificates 100% of the time.

So the backends were fine, but HAProxy was serving expired certificates 50% of the time. But HAProxy’s configuration was fine. But restarting HAProxy didn’t help. But clearing its cache didn’t help.

What distributes requests 50/50? Load balancers do… but wait, HAProxy is the load balancer. Unless…

The Breakthrough: What If There Are Two?

A crazy thought occurred to me. What if there were two HAProxy processes running? Maybe when I “restarted” it earlier, the old process never actually stopped?

ps aux | grep haproxy

THERE IT WAS:

haproxy  3996455  0.0  2.9 172700 57328 ?  Ssl  Nov17   8:15 /usr/sbin/haproxy ...
haproxy  4130413  0.1  0.7 121796 15184 ?  Sl   13:15   0:00 /usr/sbin/haproxy ...

Two HAProxy processes! One started on November 17 (8 days ago), and one started today at 13:15 when I did my “restart.”

Let me check the port bindings:

ss -tlnp | grep :443

Both processes were listening on port 443!

LISTEN 0  5000  0.0.0.0:443  0.0.0.0:*  users:(("haproxy",pid=4130413,fd=11))
LISTEN 0  5000  0.0.0.0:443  0.0.0.0:*  users:(("haproxy",pid=3996455,fd=7))

Why This Happened

This explained everything:

The old process (PID 3996455) started on November 17 and had been running for over a week. At some point, it had loaded and cached expired certificates.
The new process (PID 4130413) started today when I “restarted” HAProxy, and it had loaded fresh, valid certificates.
Both were accepting connections on port 443 thanks to the SO_REUSEPORT socket option, which allows multiple processes to bind to the same port.
The kernel was load-balancing between them - roughly 50% of connections went to the old process (expired cert), and 50% went to the new process (valid cert).
Why systemctl restart didn’t help: When systemd “restarted” HAProxy, it started a new process but failed to stop the old one. So instead of replacing the old process, it just added a new one alongside it.
Why I couldn’t find the expired certificate: All my checks looked at the filesystem and the NEW process. The expired certificate existed only in the OLD process’s memory, loaded weeks ago from certificates that had since been renewed on disk.

The Fix

One simple command:

kill 3996455

I killed the old process. Tested 20 more times. 100% success. Every single request now got the valid certificate.

Why My Debugging Methods Failed

Let me break down why each debugging attempt missed the issue:

Checking certificates on disk - The files were fine. The problem was an old process with stale certificates in memory.
Reloading HAProxy - Only reloaded the new process, didn’t touch the old one.
Checking backend servers - The backends were never the problem.
Querying HAProxy’s admin socket - I was querying the NEW process’s socket, which had valid certificates. The old process had its own socket.
Restarting HAProxy - Started a new process but failed to stop the old one.
Clearing cache directories - Only affected the new process, not the old one.
Searching the filesystem - The expired certificate existed only in the old process’s memory, not on disk.

The one thing I didn’t do early enough was check the process list. I assumed systemctl status haproxy would show me everything, but it only showed the systemd-managed process (the new one). The old process was still running independently.

Root Cause

This situation likely occurred weeks ago when someone (possibly me) tried to restart HAProxy, but the restart command failed to properly stop the old process. Instead of replacing it, systemd started a new one. For weeks, both processes were running, but the expired certificates weren’t yet expired, so nobody noticed.

Then on November 16, some certificates expired. The old process continued serving those expired certificates from memory, while the new process (started later with automatic renewals) served valid certificates. The 50/50 split made it appear random.

Lessons Learned

When behavior is probabilistic, look for multiple instances - Random or probabilistic failures (like our 50/50 split) often indicate multiple processes or servers handling requests with different configurations or states.
Don’t trust service managers blindly - systemctl status haproxy showed “active (running)” but didn’t reveal the duplicate process. Always verify with ps aux and ss -tlnp.
Port conflicts can be hidden - Both processes binding to the same port didn’t cause an obvious error because of SO_REUSEPORT. This socket option allows multiple processes to share a port, which is useful for zero-downtime restarts but can mask issues like this.
Check process start times - The ps aux output showed one process had been running for days (Nov17), while the other was just hours old. This time discrepancy was the smoking gun.
Cached state isn’t always on disk - I spent a lot of time searching filesystems and clearing cache directories, but the “cached” expired certificate existed only in the old process’s memory.
Systematic testing reveals patterns - Running 20 tests in a row and logging the results revealed the 50/50 pattern, which was the key insight that led me to suspect multiple processes.

Prevention

I’ve added monitoring to alert if multiple HAProxy processes are detected:

#!/bin/bash
# Check for multiple HAProxy worker processes
COUNT=$(pgrep -c haproxy)
if [ $COUNT -gt 2 ]; then  # Master + 1 worker is normal
    echo "WARNING: Multiple HAProxy processes detected!"
    ps aux | grep haproxy
fi

This runs every 5 minutes via cron and will send an alert if it detects the issue.

Technical Environment

Load balancer: HAProxy 2.x
Backend: Nginx with self-signed certificates (internal communication)
Frontend: Let’s Encrypt certificates (public-facing)
OS: Ubuntu Server 24.04
Architecture: HAProxy → 2 backend Nginx servers

Final Thoughts

This was one of those bugs that taught me more than a hundred successful deployments. The solution was simple (one kill command), but finding it required:

Systematic testing to identify the pattern
Questioning assumptions (like “systemctl restart works correctly”)
Checking low-level details (process lists, port bindings)
Persistence (3 hours of debugging)

Sometimes the hardest bugs to debug are the ones where your tools are lying to you - not because they’re broken, but because you’re asking them the wrong questions.

Have you ever encountered a similar issue with duplicate processes? Or spent hours debugging something that turned out to be embarrassingly simple? Share your stories in the comments - misery loves company!

# The Intermittent SSL Certificate Error: A 3-Hour Debugging Journey