Gmail update
Someone who can self-identify if desired shared Google's summary of the recent email outages (PDF). This is the outage that caused my address (and many others) to start sending permanent bounce messages.
Background: The Gmail SMTP inbound service uses a configuration system that allows specific service options and flags to be changed while the service is already deployed in production. The "gmail.com" domain name is specified as one of these configuration options. An ongoing migration was in effect to update this underlying configuration system to meet Google internal best practices.
A configuration change during this migration shifted the formatting behavior of a service option so that it incorrectly provided an invalid domain name, instead of the intended "gmail.com" domain name, to the Google MTP inbound service. As a result, the service incorrectly transformed lookups of certain email addresses ending in "(at)gmail.com" into non-existent email addresses. When the Gmail user accounts service checked each of these non-existent email addresses, the service could not detect a valid user, resulting in SMTP error code 550.
[...]
To guard against the issue recurring and to reduce the impact of similar events, we are taking the following actions:
- Update the existing configuration difference tests to detect unexpected changes to the SMTP service configuration before applying the change.
- Improve internal service logging to allow more accurate and faster diagnosis of similar types of errors.
- Implement additional restrictions on configuration changes that may affect production resources globally.
- Improve static analysis tooling for configuration differences to more accurately project differences in production behavior.
Ouch.
Fixing things in production systems is hard. I've been there; things can go wrong, sometimes badly wrong. I'm used to thinking of Google as having near-infinite resources, including a replica of their production system to test changes on. Perhaps that's unrealistic.
no subject
So it was an internal error after all, one that made the email addresses appear to be nonexistent. That's the kind of simple but devastating mistake every developer lives in fear of making. It's surprising that their tests didn't catch it.
no subject
no subject
Gah, fat-fingered that -- I thought I'd copied the link from the post rather than the URL from my browser tab. Fixed now; thanks for letting me know.
I'm surprised they haven't run into this (and thus added a test case) before now too, but every team and every developer has a first time for head-smacking "should have caught that" errors, so I guess it was theirs.
no subject
(But then we don't hear about the 99.9999% of the time GMail doesn't fall over in a big heap.)
no subject
Really not -- that's simply best practice nowadays for responsible enterprises, and I'd be surprised if Google doesn't have one.
The implication here (assuming they do have an acceptance-testing environment) is that either that replica isn't sufficiently accurate (a technical problem), or that somebody just made the change in production instead of trying it there first (a process one), or that they did make the change in the test environment but then didn't test it deeply enough (also a process one).