A Downtime, What Happened and… Very Sorry.

Last Thursday (10 Jan 2019), starting with 02:30, we experienced an issue that caused a full downtime of ~12 hours and intermittent issues more than that afterwards.

First of all, so so sorry about this. And, as a summary, it was totally our fault.

Uptime Robot is available since Jan 2010 and it is the first time we had such a major problem.

We would like to share what happened and what we’ll be doing to prevent it from repeating:

  • Our main DB server became unreachable. We first thought it was a network issue, then discovered that it wasn’t able to boot and later on made sure that the harddisk had problems.
  • We were ok as we had the replicate DB server. Decided to make it the master DB server. We couldn’t connect to this server at first, made a power reboot, then connected and made a huge personal mistake here. Before starting the (MySQL) DB server after the reboot, we had to change several of its settings so that it was ready for the live load. Besides few my.ini changes, we removed the innodb logs so that they were re-created with the right settings. Started the server, all good.. and it stopped by itself. Checked the MySQL error logs and saw that there were sync problems with MySQL’s log sequence number being in the future. The problem is, with the power reboot, the DB server was shutdown unexpectedly and we must have started it with the original settings, then stopped normally and make the changes afterwards. A simple yet huge mistake.
  • After lots of retries with different options (including forcing innodb recovery), some major tables didn’t recover.
  • And, we decided to make a full restore from the backups. We take very regular backups. We have 2 types of data:
    • the account settings, monitors, alert contacts.. (backups taken directly to the backup server every 1 hour)
    • and the logs (this data is pretty huge, backups are taken every day to the local server at first so that it is faster, automatically zipped and moved to the backup server afterwards)
      • The latest backup was ~23 minutes ago before the incident. We restored it.
      • The latest logs backup was ~7 hours ago before the incident. Yet, the zip file was corrupt. So were several of the latest backup files. The latest healthy logs backup was taken 7 days ago.
  • We tried to reach the contents of the corrupted backup files with several methods/tools but failed (this process took the most of the hours as we wanted to re-enable the DB with the latest log file backup). And, we restored the backup taken 7 days ago (since that day, we tried with much more tools, suggestions, etc.. yet, convinced that those files are corrupt at their cores).
  • We made the site live after the restore process but realized that there were many inconsistencies due the date differences of the backup files used. Worked on creating a tool to remove those inconsistencies, paused the system for another 3 hours the next day, ran this tool to recover all the inconsistencies and made the system live again.
  • After the event, when looking at it calmly, the most logical explanation is the harddisk having an issue for several days before totally going down and corrupting the local backups we had taken on it (which we then moved as corrupted).
  • And, we couldn’t restore the log (up-down) data between 03 Jan to 10 Jan.

This is actually a short summary of the issue we experienced. We did various mistakes:

  • Not using a RAID (this was due to a negative experience we had with RAID in the past but, thinking twice, it was still better than having a single corrupted harddisk).
  • Handling the replicate going master badly. We must have had a more detailed self-documentation about this process.
  • Taking larger backups locally and then moving to the backup server.
  • Also, we didn’t have a communication tool in place when the system was fully down and user data was unreachable.. which is so wrong.

We are taking several actions to make sure that such a downtime never repeats and any such issue is handled much better:

  • The backup scenarios are already changed including verification for each backup file.
  • Getting ready to move all critical servers to RAID setups (will share a scheduled maintenance for it soon).
  • Already updated our recovery documentation accordingly and will be documenting such cases in more detail from now on.
  • Working on creating a better communication channel that is not tied to our infrastructure.

Very sorry for the trouble again, we learned a lot from it and we can’t thank enough to all Uptime Robot users for supporting and helping us during the issue .

If you find Uptime Robot useful, help us spreading the word:

Comments

  1. Stephane Lavergne

    Thank you for this detailed and honest account of events. Silent data corruption is a big headache; glad to see you’ll be validating backups more actively going forwards. If your setup allows it, you might even want to go as far as switching to your auxiliary DB voluntarily once in a while, in production; no better way to keep on top of surprises.

    Reply
    1. Umut Muhaddisoglu Post author

      Hi Stephane,

      Our current setup currently doesn’t allow it (without a downtime) due to having a master-slave setup rather than a master-master. Yet, totally hear the point and we are now working on a more redundant structure that can allow us to do that.

      Reply
      1. Ian McGowan

        I am a new customer, have been in your shoes many times, and appreciate the writeup and the honesty. A lot of this is only learned from bitter experience.

        So let me share my bitter experience – if you’re not actively using your backup/DR environment, there’s no way to really know it’s going to work when disaster strikes. The ideal picture is active/active, but that’s really very difficult to accomplish. Failing that, active/passive with a 6-month or annual switch of A -> P is one way to get good at failing over and get a comfort level that DR will work.

        Reply
  2. moshe

    Can you please provide plans for handling missed checks as unknown instead of up? During the longer outage after recovery checks were sporadic at best. There was no indication of the issue as monitors showed as up.

    Reply
    1. Umut Muhaddisoglu Post author

      We’ll definitely be communicating better both with e-mails and also visually so that the status of the system/checks will be more visible.

      Reply
  3. Marco

    Thank you very much for your service and this post on the blog.

    I am very happy with your service and at the moment can only continue to suggest to use uptimerobot :)

    Thanks again for all. See you!

    Reply
  4. Alex

    Hey guys, I wanted use you site for tracking of my projects.

    But how I can trust you if your service is down and gluching right now.

    if you can not promise that this problem will not happen again tomorrow, please refund me the money that I paid for the annual PRO.

    Reply
    1. Umut Muhaddisoglu Post author

      We have already applied multiple changes to our setup and will soon announce a planned maintenance to perform a larger change for a complete redundancy. So, we are definitely improving a lot.

      Reply
  5. Josh

    Very concerning that there was an half day outage with no communication to customers – I rely on this service to give me an at-glance heartbeat of my client websites, it being unreliable without my knowledge is not acceptable. At the very least put out an e-mail letting us know you are having issues!

    That being said – being able to take a sober look at what happened and then publish it publicly like this is good practice. I am grateful that you took the time to outline what happened, the reasoning, and what you intend to do on fixing it. It’s the mark of an ethical company and that’s exactly kind of company I want to be partnered with. Even though there was a communications breakdown you handled the situation as quickly as you could and immediately recognized the need for improvements. That is all anybody can ask for, so thank you for your honesty and all the good work you did and will do in the future – Uptime Robot is a great service!

    Reply
    1. Umut Muhaddisoglu Post author

      We must have definitely communicated better. And, we have now built such a system that will allow us to communicate even if the system goes down. And, thanks very much for the understanding.

      Reply
  6. Gwin

    That’s fine, you recovered fast from the issue.
    I wish all the Uptime Robot team a happy new year, a year full of joy, clients, data and well… less corrupt hard drives haha

    Love from Brazil <3

    Reply
  7. Jens

    Thank you so much for this write-up. I always love it when great companies like yours share their mistakes so everyone, including you guys, can learn from it!

    Reply
  8. Djun

    Hahaha, I thought I’d caused the outage, because it happened just as I hit the “Resend verification email” button.

    Seriously though, I’m very relieved to see that you’ve recovered, and are learning from the situation. Uptimerobot is a simply wonderful service in design and execution.

    Looking forward to hearing more about your steps to make your databases more resilient.

    Reply
  9. Charles Butcher

    As Stephane said, I appreciate this frank account and I sympathise with your troubles. I’m sure you can use it as a learning experience. The irony is that your stats show how badly my own sites are doing at the moment, so you give me a stick with which to beat my hosting provider. Thank you for the excellent service.

    Reply
  10. Clint

    As a developer of a SaaS app, I feel your pain. You do all you can to make sure everything is always running smoothly as possible. Especially when it is your business. However as anyone knows, there will be days like this, it happens. I find it best to be upfront and honest. Most people will understand and get over it. The few that don’t, well that just comes with the territory. Been using you guys for years and have no plans to change.

    Reply
  11. Kağan

    Gözümdeki en işlevsel, modern ve kesintisiz hizmet takip yazılımı. Tüm projelerimde yararlandığım bir yazılım, problemi çözmenize sevindim. Umarım bu tarz problemleri bir daha yaşamazsınız. Başarılar Umut Bey ve tüm Uptimerobot ekibi. :)

    Reply
  12. Andy Boundy

    I’m a long-time client and paid user. I wish all providers would let us know (how, why, when)in such detail when there are issues. Your service assists when other services go down – proving that nobody is perfect!

    Only thing I ask for is a way (just a slide button) to turn-off email notifications. I know there’s a maintenance window but I want to stop emails, not monitoring. Would be handy.

    Thanks again – love the service.

    AJB

    Reply
    1. Umut Muhaddisoglu Post author

      Thanks very much for the understanding.

      We have been preparing for a better structure to minimize such risks and will be switching to it very soon (will be sharing a planned maintenance).

      Regarding notifications, pausing the alert contacts from My Settings is the easiest way. yet, if there are many alert contacts, this can become hard. Noted this as a great suggestion.

      Reply
  13. Bram

    we just went for a paid subscription because of you honesty!

    Reply
  14. Jason

    This checks out. Nobody died. A very refreshing (honest) explanation.

    Reply
  15. Glen Cooper

    Wow, nicely worded status update. I totally get… been there, done that… well, not exactly that, but I’ve had many-o-snafus just as nightmarish… nice job on EVERYTHING. UptimeRobot Rocks. You rock.
    Appreciatively signed,
    GlenBot2

    Reply
  16. Alex Schittko

    Thank you for the detailed and honest account of the events.
    I definitely recommend failover testing on a quarterly basis, as I do currently with my DR environments.
    Might I recommend statuspage.io as an external tool, not tied to any of your inf, that would allow you to communicate issues in real time? I use it for my customers.

    Uptime Robot has been a long standing part of our org, and reasons like this are why we continue.
    Don’t worry about the perfectionists flaming you, I’m sure they’ve had a DR or two go badly as well. Everyone has growing pains, the key is acknowledging them, and using them as learning experiences to not repeat the same mistakes in the future.

    Thank you for your honesty and commitment.
    We’ll probably be upping our license soon thanks to the growth our org is seeing.

    Happy Friday folks!

    Reply
  17. Christopher Quinn

    This has to be one of the most honest downtime notifications/explanations I have ever seen from a company. You guys rock. You admitted your fault, and you learned from your mistakes, everyone makes them, but not everyone admits to them, especially in the detail that you did.

    Honestly, I could feel your pain when I read this, but you came out triumphant. Keep up the good work, honesty is the best policy, and good karma will come from being so transparent.

    Reply
  18. Carlton McFarlane

    We have just started evaluating Uptime Robot with a view to using it long term. This post is exactly the kind of thing I needed to read from a service, to convince me to sign up to the pro plans and begin relying on you.

    Honest, humble and open. And clearly caring about the ongoing quality of the service. We will definitely be subscribing for the paid plan and look forward to recommending that others check Uptime Robot out!

    Thank you!

    Reply
  19. Badar Jamal

    That’s very nice to give a full account of your mistakes, and it is a lesson for the readers as well to take appropriate measures to safeguard their data. Personally, I only consider a backup reliable if I test it by a restore emulation, as @Stephane Lavergne suggested.

    Reply
  20. Jamie

    As a computer engineer, I loved reading your detailed debrief on the tech problems and solutions. I have been burned on backups before and can only sympathize. I have no idea of the complexity of how you do these tests, etc, but perhaps this might be an idea:

    Run 2 parallel live systems – no replicating – but actually 2 systems. They could even be in 2 data centers on different coasts. Users could pick their prime center (maybe getting one closer to customers, etc etc). All settings would be done on the prime, and only those would get copied to the other system. So they are both checking same basic sites, schedule, notifies.

    But when a user logs in, there would be a tab (like the TV mode) that would allow you to see the other site results. Would be slightly different timings, etc. But a full up redundancy.

    There could even be a pro upgrade where the dashboard would show differentials or other cool data across the geographic centers.

    The good thing would be if one failed, the second (or you could even have more across the globe?) would still be going and users would failover seamlessly to that dash.

    God, I know you probably are tired of unsolicited advice, but I couldn’t help myself LOL ;)

    BTW, just started evaluating your system for my work sites (a half dozen of old stuff we are working to replace plus some new web apps) – man I love your UI and the work flow is terrific. VERY nice job.
    Peace, Jamie

    PS Jezzz! I pressed submit and didn’t see the current year thing LOL. But didn’t lose the novel I write :)

    Reply
  21. Tom F

    THANK YOU for your detailed discussion of what happened.

    Your honesty and openness are refreshing.

    Reply
  22. Joel Peeples

    I love the honesty and detailed report! We’ve been a client for many years now and we LOVE the service. I’m especially loving the integration with Microsoft Teams.

    Reply

Leave a Reply to Stephane Lavergne Cancel reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>