Multiple PeopleFluent systems down

Minor incident PeopleFluent Hosted LMS
07-05-2025 13:16 CEST · 4 weeks, 20 hours, 10 minutes

Updates

Resolved

As previously communicated, the systems have been fully operational for some time now. Below is an update regarding the root cause of the issue and the measures taken to prevent recurrence.

Network Incident – Datacenter, May 6 & 15, 2025

Incident

During scheduled maintenance on May 6, 2025, at 11:00 PM CEST in the Rotterdam datacenter, network issues arose while upgrading a router. This resulted in degraded network performance and reduced service availability in general, with a specific impact on the MSSQL failover cluster for Courseware.

Impact

  • First disruption: May 6, 2025, lasting approximately 6 hours.
  • Second disruption: May 15, 2025, lasting approximately 4 hours.
  • Specific to Courseware: Due to a failure in the MSSQL failover cluster, the system became completely inaccessible, leading to a second outage period. The critical downtime lasted approximately 4 hours.

Resolution and Recovery

The network issues were resolved by manually clearing the ARP tables (the network’s address books) on the affected switches and restarting the routers. For Courseware, the database environment was manually restored, including the transfer of missing log files and reconfiguration of the failover cluster. No data loss occurred.

Preventive Measures

  • The planned replacement of outdated core routers was expedited and implemented immediately.
  • Maintenance procedures have been updated to include additional checks and rollback steps.
  • For Courseware, the MSSQL failover configuration has been changed to manual to ensure more predictable behavior during future network disruptions.
  • Escalation and communication protocols have been tightened to ensure timely customer notifications in the event of incidents.
05-06-2025 · 09:08 CEST
Investigating

Last night we were hit several times by problems with the database cluster. Although our hosting provider is doing their utmost to establish the root cause, they have not been successful as yet, unfortunately. Meanwhile, all databases are online again and your environment is running as expected, according to our monitoring. Should you experience any inconvenience nonetheless, we request that you submit a ticket via service.courseware.nl.

14-05-2025 · 09:34 CEST
Monitoring

All sites have been online since yesterday afternoon; the cluster was successfully replacted, investigation into the root cause is ongoing.

08-05-2025 · 10:22 CEST
De-escalate

Sites are becoming available again.

Background:
There are problems with the accessibility of the database since the database cluster broke down last night. This will have to be resolved by a restart, which is currently planned for tonight 20:00 CEST. If new problems arise before then, the restart will be done sooner.

07-05-2025 · 13:28 CEST
Investigating

Multiple PeopleFluent system are unavailable at the moment. Our hosting provider is investigating. We will update this information a.s.a.p.

07-05-2025 · 13:16 CEST

← Back