In an incident report published on Friday, Google said that a Google Voice outage affecting a majority of the telephone service’s users earlier this month was caused by expired TLS certificates.
This worldwide outage prevented most Google Voice users from logging into their accounts and using the service for more than four hours between February 15th and February 16th, 2021.
“Google Voice users experienced an issue in which some new inbound or outbound Voice over Internet Protocol (VoIP) calls failed to connect, for a total duration of 4hours 22 minutes,” the incident report reads.
“Peak impact occurred at approximately 03:00, at which time mitigation efforts began to reduce failure rates.”
During regular operation, voice calls made through Google Voice are controlled using the Session Initiation Protocol (SIP), with client devices immediately retrying their connection to the service once it breaks.
Transport Layer Security (TLS) certificates used to encrypt all Google Voice traffic are also rotated regularly to keep the connections and traffic secure.
Google Voice outage root cause and impact
“Due to an issue with updating certificate configurations, the active certificate in Google Voice frontend systems inadvertently expired at 2021-02-15 23:51:00, triggering the issue,” Google explained.
“During the impact period, any clients attempting to establish or reestablish an SIP connection were unable to do so.”
After the expired certificates triggered the outage, users could not access the Google Voice service to make or receive VoIP calls.
However, client devices that already had an active SIP connection before the incident were unaffected during the outage (as long as the connection was not interrupted).
“After investigating, the engineering team determined that certificate configuration was the root cause,” Google added. The team generated updated certificates and configuration information and began an emergency rollout of this data to frontend systems.”
After rolling out the mitigation, affected Google Voice SIP clients restored functionality after retrying their connection to the service.
Measures to prevent future outages
The Google engineering team is taking several actions designed to prevent a similar issue from occurring again and decrease the impact of future outages.
As the Google Workspace Team that published the incident report said, the engineers are taking the following measures:
- Configure additional proactive alerting for upcoming certificate expiration events.
- Configure additional reactive alerting for TLS errors in Google Voice frontend systems.
- Improve automated tooling for certificate rotation and configuration updates.
- Utilize more flexible infrastructure for rapid deployment of configuration changes.
- Update resource allocation systems to more efficiently provision emergency resources during incidents.
- Develop training and practice scenarios for emergency rollouts of Google Voice frontend systems and configurations.
In December 2020, Google suffered a global authentication system outage that affected most of its consumer-facing services, including Gmail, YouTube, Google Drive, Google Maps, and Google Calendar.
As Google explained later that month, that incident was caused by a bug in the automated quota management system, which blocked users from logging into their accounts and authenticating to Google Cloud services.