24 November 2024 Incident Post-Mortem

in #incident2 months ago (edited)


A severe power outage (due to surge) occurred on 24 November 2024 10:21 UTC, which took down my OPNSense router and therefore rendered the API server inaccessible to perform a graceful shutdown on-time.

This has been the most severe power incident experienced ever as the outage lasted for around 5 hours before the power supply was restored. The power surge blew the fuse at electric meter, one Ethernet port on router connected to UPS (indirectly connected to ISP modem which was not affected) and many other unrelated appliances.

Graphic Content Warning

The worst part was the living room power plugs which went up in flames, frying a TV, HDMI cable and a GT 1030 that was connected to it. The rest of the computer was fine and everyone is safe here.

Cause (and fix)

Investigations that were performed lead to bad earthing equipment on-premise, which was also likely the cause of most previous (less severe) incidents. This issue have been rectified a few days ago.

Affected services and recovery

techcoderx.com API node and VSC-HAF API powering VSC Blocks were primarily affected by this incident. As I was unable to perform a graceful shutdown on-time and move the shared_memory.bin file from RAM to disk, the node required a replay. All apps (excluding Hivemind) have been replayed and in sync.

While at it, I have updated the stack to v1.27.6rc9 with the latest REST APIs. There was an issue with the HFM upgrade script in this RC that breaks the HAF database preventing apps to sync, this was remedied by commenting out the script at the docker entrypoint within the running container then running docker commit to save as image containing the changes. This must be done during the first run before any restarts to prevent the problematic script from executing until the fix has been rolled out in the next release.

Witness node was unaffected as it was running on a lite node on a cheap VPS. It missed a few blocks over multiple weeks while it was running there but better than it being affected and missing more blocks before I get to disable while fixing it. The witness is now running on-premise again on a different hived instance and the watcher script will be deployed for failover in case of any issues in the future.

The WAN interface on router have been re-assigned and connected directly to the modem instead of to the UPS.

Next steps

I will be updating this post once Hivemind finishes syncing with the usual statistics below. Hivemind sync was only started after all other apps have finished syncing to speed those up in order to restore VSC-HAF API and HAF block explorer as soon as possible.

Maintaining an offline copy of the HAF database snapshot is unfeasible as it needs to be constantly updated for all incoming new blocks and HAF updates. It may not be worth the time taken to export the snapshot compared to the potential downtime savings. The other method would involve syncing another copy of HAF node with the same apps on another server which would benefit from redundancy, however at a cost of additional hardware required.


Witness performance

Current rank: 27th
Votes: 64,554 MVests
Voter count: 180

Producer rewards (7 days): 462.395 HP
Producer rewards (30 days): 2,000.228 HP
Missed blocks (all-time): 37

Server resource statistics

hived (v1.27.6rc9)

block_log file size (compressed): 493 GB
block_log.artifacts file size: 2.1 GB
shared_memory.bin file size: 23 GB

HAF db

All HAF apps belong to individual schemas in a single PostgreSQL database along with data such as app state providers in the hafd schema. This section shows the sizes of each schema in the database using the following query:

SELECT schemaname,
    pg_size_pretty(SUM(pg_total_relation_size(relid))) AS total_size,
    pg_size_pretty(SUM(pg_table_size(relid))) AS table_size,
    pg_size_pretty(SUM(pg_indexes_size(relid))) AS indexes_size
FROM pg_catalog.pg_statio_user_tables
GROUP BY schemaname;
Output
          schemaname          | total_size | table_size | indexes_size 
------------------------------+------------+------------+--------------
 hafbe_bal                    | 88 GB      | 66 GB      | 22 GB
 hafd                         | 3001 GB    | 1911 GB    | 1090 GB
 reptracker_app               | 17 GB      | 17 GB      | 462 MB
 hivemind_app                 | 753 GB     | 412 GB     | 340 GB
 hivemind_postgrest_utilities | 64 kB      | 48 kB      | 16 kB
 hafah_python                 | 16 kB      | 16 kB      | 0 bytes
 btracker_account_dump        | 8192 bytes | 0 bytes    | 8192 bytes
 vsc_app                      | 54 MB      | 46 MB      | 9016 kB
 reptracker_account_dump      | 8192 bytes | 0 bytes    | 8192 bytes
 hafbe_app                    | 839 MB     | 610 MB     | 230 MB
 hafbe_backend                | 32 kB      | 16 kB      | 16 kB
 cron                         | 320 kB     | 272 kB     | 48 kB
(12 rows)
Disk usage

Compressed tablespace: 1.8 TB
Compression ratio: 2.11x


Sort:  

Sorry to hear this happy you're all safe though. Electric fires are a nightmare.

Indeed, it was a big mess in that area and most of the unrelated stuff aren't mine (I don't use that space most of the time). This incident has been fully resolved.

Congratulations @techcoderx! You have completed the following achievement on the Hive blockchain And have been rewarded with New badge(s)

You distributed more than 28000 upvotes.
Your next target is to reach 29000 upvotes.

You can view your badges on your board and compare yourself to others in the Ranking
If you no longer want to receive notifications, reply to this comment with the word STOP

Check out our last posts:

LEO Power Up Day - December 15, 2024