Here is a small and handy addition for having an extra level of security for the accounts.
Two-factor authentication is a widely-used mechanism for making sure that the account is only reachable by you and Uptime Robot now supports it.
The feature can be found at My Settings>Two-Factor Authentication (2FA) menu and you can choose to use your favorite authenticator app like Google Authenticator, Authy.. for activating it.
Once activated, the login pages will ask for the authentication code (besides the password) which can only be generated with your authenticator app (and, we are working on adding this feature to various other actions besides the login).
Hope that it helps for a better experience :).
Uptime Robot can already check the status of servers/devices who have public IPs with its ping and port monitoring feature.
Yet, there are many other servers/computers/devices that are inside an intranet (but connected to the internet) and need to be monitored.
It is now possible to monitor such endpoints using heartbeat monitoring.
The feature works in an opposite way compared to other monitoring types.
Uptime Robot provides a unique URL for each heartbeat monitor created and expects the monitored item to send regular requests to this URL.
Once a regular request doesn’t arrive on time, the monitor is marked as down.
Heartbeat monitoring is not only ideal for monitoring servers/computers inside an intranet but also a great fit for monitoring the health of the regular/cron jobs your website/app may be performing.
As an example, if your app runs a cron job which deletes the old logs every 10 minutes, you can update the code to send a HTTP request to the heartbeat monitor’s URL once that cron job is ran and know that the cron job may be having problems if the heartbeat monitor is down.
How to use the feature?
Heartbeat monitoring is available in the Pro Plan and it works with steps:
- Create a new heartbeat monitor using the Add New Monitor dialog
- Get the URL of the heartbeat monitor created in the same dialog
- Setup a cron job (or a scheduled task in Windows) that sends an HTTP request to this heartbeat URL every x minutes (where x is the interval selected for the monitor)
- That is it.
Alternatively, please check the docs for creating cron jobs in Unix/Linux and scheduled tasks in Windows.
P.S> The feature is in beta status and look forward to any feedback/suggestions.
A status page is a very easy-to-setup, nice and automated way to share the status of the websites/servers with visitors, users and teammates.
And, the ability to share additional info with users like current issues or an upcoming maintenance can only make it better.
We have just added this feature (in the Pro Plan) where it is now possible to add announcements to the status pages and the feature can be viewed in action at Uptime Robot’s status page.
How to use it?
The feature can be reached from My Settings>Public Status Pages>Any status page>Announcement icon.
There are currently 3 announcement types:
which covers almost all use-cases and we are open to suggestions for additional ones.
An announcement can be set to be published and auto-resolved/archived at a future date.
And, hoping that it’ll help for a better communication with users and teammates.
Uptime Robot provided 2 api-key types: a master api_key and monitor-specific api_key.
The master api_key can be used to perform almost every action exists in the dashboard and it must not be revealed for the security of the accounts.
Yet, there are cases where the api_key may needed to be revealed like using it in client-side code or sharing with customers. For this reason, monitor-specific api_keys (which can only use the
getMonitors method for the given monitor) were introduced and they help a lot.
A new api_key type is added today to simplify the use further: Read-Only Api-Key.
Similar to monitor-specific api_keys, it can again only use the
getMonitors method. Yet, it supports fetching data for all the monitors which is ideal for sharing it with teammates or using in client-side code which needs to deal with the data of multiple monitors.
Hope it helps :).
Caching is a great way to improve website performance and minimize the load.
An ideal cache displays the cached version until the content changes and flushes the cache when there is a change. Yet, there may be cases where the cached version is not the most up-to-date one (if there is a DB error on the site or the caching is only time-based..).
And, we may want Uptime Robot to load the non-cached version on each request to make sure that the uptime/downtime is decided accordingly.
Here is a tiny feature (a pro tip) that can help bypassing cache.
Uptime Robot will auto-replace the string:
in the querystring with a unique timestamp every time so that each request is unique.
As an example, if the website to be monitored is:
We can use the URL as:
and the request will have a different querystring each time.
Hope this helps for a better uptime.
All the requests sent from Uptime Robot are using pre-defined IPs to make sure that we all know the source of the requests.
A new IP block is being added to the system:
And, if exists, please make sure it is whitelisted in your firewalls.
Also, the full list of the IPs used can be found here.
P.S.> If you had never needed to whitelist or take action regarding to Uptime Robot’s IPs in the past, then you probably don’t need to take any action and can ignore this info.
Uptime Robot treats all HTTP statuses equally. They mean either up or down… except HTTP 401.
HTTP 401 is expected in some situations and not expected in others. Currently, HTTP 401 is handled as:
- If auth info is mentioned in monitor’s settings but HTTP 401 is returned, monitor is marked as down
- if no auth info is mentioned but HTTP 401 is returned, it is marked as up
which looked like the best way at the early days of Uptime Robot.
Yet, there are edge cases on both scenarios like “a monitor with no auth info returning HTTP 401″ may also mean that the site/server is experiencing configuration issues and this must be detected as down.
As there is now a Pro Plan feature to customize HTTP statuses, Uptime Robot will start treating HTTP 401 just like other HTTP statuses (which are equal to or bigger than 400):
- will be considered as down by default no matter auth info exists or not
- if needed, it will be customizable with the HTTP status customization feature.
This change will give room to handling this HTTP status however preferred and the change is planned to go live on 1 March 2019.
Last Thursday (10 Jan 2019), starting with 02:30, we experienced an issue that caused a full downtime of ~12 hours and intermittent issues more than that afterwards.
First of all, so so sorry about this. And, as a summary, it was totally our fault.
Uptime Robot is available since Jan 2010 and it is the first time we had such a major problem.
We would like to share what happened and what we’ll be doing to prevent it from repeating:
- Our main DB server became unreachable. We first thought it was a network issue, then discovered that it wasn’t able to boot and later on made sure that the harddisk had problems.
- We were ok as we had the replicate DB server. Decided to make it the master DB server. We couldn’t connect to this server at first, made a power reboot, then connected and made a huge personal mistake here. Before starting the (MySQL) DB server after the reboot, we had to change several of its settings so that it was ready for the live load. Besides few my.ini changes, we removed the innodb logs so that they were re-created with the right settings. Started the server, all good.. and it stopped by itself. Checked the MySQL error logs and saw that there were sync problems with MySQL’s log sequence number being in the future. The problem is, with the power reboot, the DB server was shutdown unexpectedly and we must have started it with the original settings, then stopped normally and make the changes afterwards. A simple yet huge mistake.
- After lots of retries with different options (including forcing innodb recovery), some major tables didn’t recover.
- And, we decided to make a full restore from the backups. We take very regular backups. We have 2 types of data:
- the account settings, monitors, alert contacts.. (backups taken directly to the backup server every 1 hour)
- and the logs (this data is pretty huge, backups are taken every day to the local server at first so that it is faster, automatically zipped and moved to the backup server afterwards)
- The latest backup was ~23 minutes ago before the incident. We restored it.
- The latest logs backup was ~7 hours ago before the incident. Yet, the zip file was corrupt. So were several of the latest backup files. The latest healthy logs backup was taken 7 days ago.
- We tried to reach the contents of the corrupted backup files with several methods/tools but failed (this process took the most of the hours as we wanted to re-enable the DB with the latest log file backup). And, we restored the backup taken 7 days ago (since that day, we tried with much more tools, suggestions, etc.. yet, convinced that those files are corrupt at their cores).
- We made the site live after the restore process but realized that there were many inconsistencies due the date differences of the backup files used. Worked on creating a tool to remove those inconsistencies, paused the system for another 3 hours the next day, ran this tool to recover all the inconsistencies and made the system live again.
- After the event, when looking at it calmly, the most logical explanation is the harddisk having an issue for several days before totally going down and corrupting the local backups we had taken on it (which we then moved as corrupted).
- And, we couldn’t restore the log (up-down) data between 03 Jan to 10 Jan.
This is actually a short summary of the issue we experienced. We did various mistakes:
- Not using a RAID (this was due to a negative experience we had with RAID in the past but, thinking twice, it was still better than having a single corrupted harddisk).
- Handling the replicate going master badly. We must have had a more detailed self-documentation about this process.
- Taking larger backups locally and then moving to the backup server.
- Also, we didn’t have a communication tool in place when the system was fully down and user data was unreachable.. which is so wrong.
We are taking several actions to make sure that such a downtime never repeats and any such issue is handled much better:
- The backup scenarios are already changed including verification for each backup file.
- Getting ready to move all critical servers to RAID setups (will share a scheduled maintenance for it soon).
- Already updated our recovery documentation accordingly and will be documenting such cases in more detail from now on.
- Working on creating a better communication channel that is not tied to our infrastructure.
Very sorry for the trouble again, we learned a lot from it and we can’t thank enough to all Uptime Robot users for supporting and helping us during the issue .
This entry was posted on by Umut Muhaddisoglu.
Uptime Robot is already integrated with the major team communication apps and here is another addition: Google Hangouts Chat.
If you already use Hangouts Chat (which is part of the G Suite of Google), the integration can be setup with just these few steps:
- Inside Hangouts Chat, create a new web-hook URL in Room menu>Configure webhooks>Add new.
- Inside Uptime Robot, create a new alert contact in My Settings>Alert Contacts>Add new>Google Hangouts Chat using the previously created Hangouts Chat web-hook URL.
- Attach this new alert contact to the monitors of your choice from add/edit monitor dialogs.
- That is it.
This entry was posted on by Umut Muhaddisoglu.
Google had announced that Chrome would begin distrusting certificates issued by Symantec Corporation’s PKI and the decision is followed by other major browsers.
These are the certificates by Thawte, VeriSign, Equifax, GeoTrust, and RapidSSL that are issued before 1st of December 2017.
And, as the distrust is in effect and visitors are displayed an error on these websites (like ERR_CERT_SYMANTEC_LEGACY), Uptime Robot’s SSL monitoring feature now also considers these errors as a reason for downtime.
Such downtimes are displayed as “Distrusted Certificate” in the dashboard and the feature is live in the Pro Plan.