Quick hack to monitor the health of a collection of AWS services

My SaaS app WP Lookout uses Amazon Web Services for hosting. I’ve had plenty of monitoring set up using both internal and external tools notifications, and these mostly output to a combination of a dedicated Slack channel and/or a Pushover notification that goes to my mobile device.

The one thing I didn’t have in place until recently was monitoring for the health of the specific group of AWS services the app depends on. With AWS offering many services spread across many parts of the world (as illustrated in the Very Long Service Health Dashboard), I wasn’t about to start monitoring all of the AWS infrastructure just to see if my particular app was affected.

They do offer a “Personal Health Dashboard” that’s supposedly more tailored to incidents affecting my AWS account, but I find it unintuitive and confusing to use, especially when it comes to setting up notifications to external services like Slack. In my online searching I found various services and tools that supposedly integrated the PHD into more standard monitoring and alerting options, but they were also either too complicated or seemed to require additional levels of paid AWS support plans that I didn’t want to take on.

So I decided to keep things simple and use the tried and true technology for polling information about external resources: RSS!

Continue reading Quick hack to monitor the health of a collection of AWS services

Getting alerts about outdated Time Machine backups

Our household uses Apple’s Time Machine backup system as one of our backup methods. Various Macs are set to run their backups at night (we use TimeMachineEditor to schedule backups so they don’t take up local bandwidth or computing resources during the day) and everything goes to a Synology NAS (basically following the steps in this great guide from 9to5Mac).

But, Time Machine doesn’t necessarily complete a successful backup for every machine every night. Maybe the Mac is turned off or traveling away from home. Maybe the backup was interrupted for some reason. Maybe the backup destination isn’t on or working correctly. Most of these conditions can self-correct within a day or two.

But, I wanted to have a way to be notified if any given Mac had not successfully completed a backup after a few days so that I could take a look. If, say, three days go by without a successful backup, I start to get nervous. I also didn’t want to rely on any given Mac’s human owner/user having to notice this issue and remember to tell me about it.

I decided to handle this by having a local script on each machine send a daily backup status to a centralized database, accomplished through a new endpoint on my custom webhook server. The webhook call stores the status update locally in a database table. Another script runs daily to make sure every machine is current, and sends a warning if anyone is running behind.

Here are the details in case it helps anyone else.

Continue reading Getting alerts about outdated Time Machine backups

Monitoring WordPress events and status with a custom API endpoint

Let’s say your WordPress site has some set of custom functionality that is important to the overall operation of the site, and you want to know right away if it’s not working as expected, even if the site is otherwise “up” and working fine. There could be anywhere from 0 to many things needing attention at any given time, and you don’t want to receive a flood of emails or Slack pings that you have to sort through, you just want a single alert that things are off track, and another notice when they’re back to being in good shape.

I recently handled this case using transients, a custom REST API endpoint, and the service UptimeRobot. The context for me was a set of functions that regularly retrieve information from a variety of third-party sources; most of the time it goes fine, but between network issues, changes in third-party API endpoints or HTML source code and other possible errors, occasionally these functions would break and need updating.

First, I established an error function that was called any time some aspect of my site’s custom functionality encountered a problem that might need my attention.

public static function record_event_fetch_failure( $source = null, $message = null ) {

	if ( empty( $source ) ) {
		return false;
	}

	$source         = esc_attr( $source );
	$transient_name = 'event_fetch_failure_' . $source;

	$failure_data = array(
		'count'                => 1,
		'last_error_timestamp' => gmdate( 'Y-m-d H:i:sP', time() ),
		'last_error_message'   => esc_html( $message ),
	);

	// If the transient is already there, update it.
	$event_failure_counter = get_transient( $transient_name );
	if ( false !== $event_failure_counter ) {
		$failure_data['count'] = $event_failure_counter['count'] + 1;
	}
	set_transient( $transient_name, $failure_data, 24 * HOUR_IN_SECONDS );

}

When called, this function increases the counter of the number of errors interacting with a passed third party data source, storing that counter in a transient. As my custom functions run, any failures will be recorded for up to 24 hours. If there are no additional failures to increase the counter and extend the transient expiration time, then the failure data will go away with the assumption that things are back to normal now. (You may need to adjust these assumptions for your use case.)

Then, I create a REST API endpoint on the site that allows me to monitor that failure data externally.

add_action( 'rest_api_init', function() {
			register_rest_route(
				'mysite/v1',
				'/event_fetch_status',
				array(
					'methods'  => 'GET',
					'callback' => array( $this, 'mysite_event_fetch_status' ),
				)
			);
		} );

And then a callback function to determine the content of that API endpoint:

public static function mysite_event_fetch_status() {

	$error_count = 0;

	foreach ( array( 'facebook', 'eventbrite', 'googlecal' ) as $event_source ) {
		$fail_data = get_transient( 'event_fetch_failure_' . $event_source );
		if ( false !== $fail_data ) {
			$error_count += $fail_data['count'];
		}
	}

	if ( 0 < $error_count ) {
		echo sprintf( 'There have been %d recent event fetch errors.', (int) $error_count );
	} else {
		echo 'OK';
	}
}

Now, I have a REST API endpoint available at https://example.com/wp-json/mysite/v1/event_fetch_status that will either return OK if there have been no recent problems, or an error message with a count of recent issues. I could expand that output to include more detail about which third-party services are having issues and what those issues are, but for the purposes of a red versus green monitoring setup, the basics are fine and I can look into the details when I investigate.

Finally, I set up a monitor in UptimeRobot to check that endpoint on a regular basis and notify me if there’s a problem:

UptimeRobot monitor screenshot
UptimeRobot monitor screenshot

Just for good measure, I also create an admin notice in the WordPress admin area with a little more detail about what is failing:

public static function mysite_event_fetch_admin_notice() {

	$error_count    = 0;
	$error_messages = array();

	foreach ( array( 'facebook', 'eventbrite', 'googlecal' ) as $event_source ) {
		$fail_data = get_transient( 'event_fetch_failure_' . $event_source );
		if ( false !== $fail_data ) {
			$error_count     += $fail_data['count'];
			$error_messages[] = $fail_data['last_error_timestamp'] . ': ' . $fail_data['last_error_message'];
		}
	}

	if ( 0 < $error_count ) {
		echo '<div class="notice notice-warning">';
		echo sprintf( '<p>There have been %d recent event fetch errors.</p>', (int) $error_count );
		echo '<ul>';
		foreach ( $error_messages as $message ) {
			echo sprintf( '<li>%s</li>', esc_html( $message ) );
		}
		echo '</ul>';
		echo '</div>';
	}
}

add_action( 'admin_notices', array( $this, 'mysite_event_fetch_admin_notice' ) );

All put together, I will now receive alerts as configured in UptimeRobot when my custom functions have issues.

You could go the typical route of generating an email or Slack message about each problem, but in my experience this can quickly create a lot of one-off monitoring and alerting configurations in your life, and that can lead to you missing important information or being desensitized to the notices. Instead, I find it’s worth trying to manage all of my time-sensitive notifications across all of my various projects and services in one place where possible, and UptimeRobot or similar services offer a lot of flexibility for that.