Last Thursday (10 Jan 2019), starting with 02:30, we experienced an issue that caused a full downtime of ~12 hours and intermittent issues more than that afterwards.
First of all, so so sorry about this. And, as a summary, it was totally our fault.
Uptime Robot is available since Jan 2010 and it is the first time we had such a major problem.
We would like to share what happened and what we’ll be doing to prevent it from repeating:
- Our main DB server became unreachable. We first thought it was a network issue, then discovered that it wasn’t able to boot and later on made sure that the harddisk had problems.
- We were ok as we had the replicate DB server. Decided to make it the master DB server. We couldn’t connect to this server at first, made a power reboot, then connected and made a huge personal mistake here. Before starting the (MySQL) DB server after the reboot, we had to change several of its settings so that it was ready for the live load. Besides few my.ini changes, we removed the innodb logs so that they were re-created with the right settings. Started the server, all good.. and it stopped by itself. Checked the MySQL error logs and saw that there were sync problems with MySQL’s log sequence number being in the future. The problem is, with the power reboot, the DB server was shutdown unexpectedly and we must have started it with the original settings, then stopped normally and make the changes afterwards. A simple yet huge mistake.
- After lots of retries with different options (including forcing innodb recovery), some major tables didn’t recover.
- And, we decided to make a full restore from the backups. We take very regular backups. We have 2 types of data:
- the account settings, monitors, alert contacts.. (backups taken directly to the backup server every 1 hour)
- and the logs (this data is pretty huge, backups are taken every day to the local server at first so that it is faster, automatically zipped and moved to the backup server afterwards)
- The latest backup was ~23 minutes ago before the incident. We restored it.
- The latest logs backup was ~7 hours ago before the incident. Yet, the zip file was corrupt. So were several of the latest backup files. The latest healthy logs backup was taken 7 days ago.
- We tried to reach the contents of the corrupted backup files with several methods/tools but failed (this process took the most of the hours as we wanted to re-enable the DB with the latest log file backup). And, we restored the backup taken 7 days ago (since that day, we tried with much more tools, suggestions, etc.. yet, convinced that those files are corrupt at their cores).
- We made the site live after the restore process but realized that there were many inconsistencies due the date differences of the backup files used. Worked on creating a tool to remove those inconsistencies, paused the system for another 3 hours the next day, ran this tool to recover all the inconsistencies and made the system live again.
- After the event, when looking at it calmly, the most logical explanation is the harddisk having an issue for several days before totally going down and corrupting the local backups we had taken on it (which we then moved as corrupted).
- And, we couldn’t restore the log (up-down) data between 03 Jan to 10 Jan.
This is actually a short summary of the issue we experienced. We did various mistakes:
- Not using a RAID (this was due to a negative experience we had with RAID in the past but, thinking twice, it was still better than having a single corrupted harddisk).
- Handling the replicate going master badly. We must have had a more detailed self-documentation about this process.
- Taking larger backups locally and then moving to the backup server.
- Also, we didn’t have a communication tool in place when the system was fully down and user data was unreachable.. which is so wrong.
We are taking several actions to make sure that such a downtime never repeats and any such issue is handled much better:
- The backup scenarios are already changed including verification for each backup file.
- Getting ready to move all critical servers to RAID setups (will share a scheduled maintenance for it soon).
- Already updated our recovery documentation accordingly and will be documenting such cases in more detail from now on.
- Working on creating a better communication channel that is not tied to our infrastructure.
Very sorry for the trouble again, we learned a lot from it and we can’t thank enough to all Uptime Robot users for supporting and helping us during the issue .
This entry was posted on by Umut Muhaddisoglu.
Uptime Robot is already integrated with the major team communication apps and here is another addition: Google Hangouts Chat.
If you already use Hangouts Chat (which is part of the G Suite of Google), the integration can be setup with just these few steps:
- Inside Hangouts Chat, create a new web-hook URL in Room menu>Configure webhooks>Add new.
- Inside Uptime Robot, create a new alert contact in My Settings>Alert Contacts>Add new>Google Hangouts Chat using the previously created Hangouts Chat web-hook URL.
- Attach this new alert contact to the monitors of your choice from add/edit monitor dialogs.
- That is it.
This entry was posted on by Umut Muhaddisoglu.
Google had announced that Chrome would begin distrusting certificates issued by Symantec Corporation’s PKI and the decision is followed by other major browsers.
These are the certificates by Thawte, VeriSign, Equifax, GeoTrust, and RapidSSL that are issued before 1st of December 2017.
And, as the distrust is in effect and visitors are displayed an error on these websites (like ERR_CERT_SYMANTEC_LEGACY), Uptime Robot’s SSL monitoring feature now also considers these errors as a reason for downtime.
Such downtimes are displayed as “Distrusted Certificate” in the dashboard and the feature is live in the Pro Plan.
Uptime Robot sends HEAD requests for HTTP monitors and GET request for keyword monitors by default (and, this is a good default setting for most monitors).
On the other hand, there are cases when a customization may be needed, like:
- Checking if a form in the website works as expected
- Monitoring your APIs which expect specific methods (a perfect match together with the custom HTTP headers feature).
So, here comes the HTTP method selection which enables us to choose the method, send the parameters to be sent (if needed) and also decide if the data will be sent as JSON or not.
The feature can be reached from Add/Edit Monitor dialogs>Advanced Settings>HTTP Method and also through the API.
P.S.> HTTP method selection is a Pro-Plan only feature.
There is never “too much” when it comes to the ways of getting notified about an emergency. And, in some cases, this emergency can be a website or server going down.
Today, voice call notification is added to Uptime Robot and it can be very helpful when we want to make sure that a notification is heard :).
How to add voice call alert contacts?
They are added just like other alert contacts with the steps:
- My Settings>Alert Contacts>New>Voice Call
- Once the number is added, an automated call will reach instantly to deliver an activation code.
- Click the lock icon besides this newly created alert contact and enter the activation code received in the automated call.
- And, attach this alert contact to the monitors of your choice from the add/edit monitor dialogs.
No one prefers to get disturbed and there are ways to make sure the voice calls are received only when there is an important downtime.
Uptime Robot has an advanced notifications feature (in the Pro Plan) to get notified only when the downtime is longer than x minutes and using this feature together with the voice calls will be a smart choice.
Additionally, it may only be the “voice calls for the down notifications” that we prefer to get. A very recently introduced feature also enables that too.
The feature is priced just the same as SMS messages (and a call is considered successful only when the call is answered).
P.S.> As a reminder, it is now possible to get 2x SMS or voice calls for the same price (more details).
Uptime Robot sends notifications for down and up notifications by default for each alert contact type.
However, there can be cases where you may only need the down or up notifications like:
- minimizing the SMS use
- handling only the up or down events via web-hooks
- getting only the SSL expiry notifications but ignoring up/down events
There is now an option inside the new/edit alert contact dialog where we can choose to disable down or up notifications for a given alert contact.
Simple..but it can be powerful :).
This entry was posted on by Umut Muhaddisoglu.
Uptime Robot supports multiple methods to get notified about downtimes on mobile (SMS, push notifications via the mobile app, Pushbullet, Pushover or Boxcar).
And, SMS is one of the most reliable notification methods specially when no data plan exists.
Also, many users prefer to configure their SMS alert contacts as:
- “alert if down for x minutes” (where x is usually 10+) (check this feature)
to make sure that they are notified of longer/important downtimes in case there is no data connectivity.
And, the users can now add 2x more SMS messages for the same price and here are the updated prices:
- 100 SMS – $15
- 200 SMS – $25
- 500 SMS – $55
- 1000 SMS – $100
Also, the SMS messages included in each Pro Plan is doubled as well.
We’ll also be introducing few new features (very soon) to make sure you can use the SMS messages more effectively. Yay!
Getting the notifications inside the team communication app you use can make things easy as this may be where you’ll be discussing how to make that website/server back online.
Besides the Slack integration, Uptime Robot now has support for Microsoft Teams too.
It simply works by creating an incoming webhook URL at the Teams app and creating a new alert contact at Uptime Robot (My Settings>Alert Contacts>New) using this webhook URL.
After that, just attach this alert contact to the monitors of your choice and the notifications will be delivered to the preferred team.
In case you haven’t seen or used it, there is a “Bulk Actions link” just under the “Add Monitor button” in the left sidebar.
It simply opens the “Bulk Actions dialog” and presents a set of actions that can be applied to monitors in bulk.
This feature is now more powerful with few important additions including:
- support for maintenance windows
- support for SSL settings
- applying the actions only to selected monitors (besides all monitors)
- choosing to overwrite or apply by preserving the previous settings
Hope that they will help and we are already working on the expected addition.. which is “bulk importing monitors” :).
Thanks to its bot framework and API support, it is now possible to get down/up/SSL notifications via Telegram messaging app.
The usage is pretty simple:
- go to My Settings>Alert Contacts>New>Telegram
- create the alert contact
- click the unique Telegram link created
- press /start button displayed in the Telegram dialog
- and, you are all set, just attach this alert contact to the monitors of your choice via add/edit monitor dialogs.
Hope that this new feature helps for better notifications.