I'm a few days away from hitting 6 years at Canonical and I've ended up doing a lot more management than anything else in that time. Before that I did a solid 8 years at my own company, doing anything from developing, project managing, product managing, engineering managing, sales and accounting.
This time of the year is performance review time at Canonical, so it's gotten me thinking a lot about my role and how my view on engineering management has evolved over the years.
A key insights I've had from a former boss, Elliot Murphy, was viewing it as a support role for others to do their job rather than a follow-the-leader approach. I had heard the phrase "As a manager, I work for you" a few times over the years, but it rarely seemed true and felt mostly like a good concept to make people happy but not really applied in practice in any meaningful way.
Of all the approaches I've taken or seen, a role where you're there to unblock developers more than anything else, I believe is the best one. And unless you're a bit power-hungry on some level, it's probably the most enjoyable way of being a manager.
It's not to be applied blindly, though, I think a few conditions have to be met:
1) The team has to be fairly experienced/senior/smart, I think if it isn't it breaks down to often
2) You need to understand very clearly what needs doing and why, and need to invest heavily and frequently in communicated it to the team, both the global context as well as how it applies to them individually
3) You need to build a relationship of trust with each person and need to trust them, because trust is always a 2-way street
4) You need to be enough of an engineer to understand problems in depth when explained, know when to defer to other's judgments (which should be the common case when the team generally smart and experienced) and be capable of tie-breaking in a technical-savvy way
5) Have anyone who's ego doesn't fit in a small, 100ml container, leave it at home
There are many more things to do, but I think if you don't have those five, everything else is hard to hold together. In general, if the team is smart and experienced, understands what needs doing and why, and like their job, almost everything else self-organizes.
If it isn't self-organizing well enough, walk over those 5 points, one or several must be mis-aligned. More often than not, it's 2). Communication is hard, expensive and more of an art than a science. Most of the times things have seemed to stumble a bit, it's been a failure of how I understood what we should be doing as a team, or a failure on how I communicated it to everyone else as it evolved over time.
Second most frequent I think is 1), but that may vary more depending on your team, company and project.
Oh, and actually caring about people and what you do helps a lot, but that helps a lot in life in general, so do that anyway regardless of you role
Now that all the responsible disclosure processes have been followed through, I’d like to tell everyone a story of my very bad week last week. Don’t worry, it has a happy ending.
Part 1: Exposition
On May 5th we got a support request from a user who observed confusing behaviour in one of our systems. Our support staff immediately escalated it to me and my team sprung into action for what ended up being a 48-hour rollercoaster ride that ended with us reporting upstream to Django a security bug.
The bug, in a nutshell, is that when the following conditions lines up, a system could end up serving a request to one user that was meant for another:
- You are authenticating requests with cookies, OAuth or other authentication mechanisms
- The user is using any version of Internet Explorer or Chromeframe (to be more precise, anything with “MSIE” in the request user agent)
- You (or an ISP in the middle) are caching requests between Django and the internet (except Varnish’s default configuration, for reasons we’ll get to)
- You are serving the same URL with different content to different users
We rarely saw this combination of conditions because users of services provided by Canonical generally have a bias towards not using Internet Explorer, as you’d expect from a company who develops the world’s most used Linux distribution.
Part 2: Rising Action
Now, one may think that the bug is obvious, and wonder how it went unnoticed since 2008, but this really was one was one of those elusive “ninja-bugs” you hear about on the Internet and it took us quite a bit of effort to track it down.
In debugging situations such as this, the first step is generally to figure out how to reproduce the bug. In fact, figuring out how to reproduce it is often the lion’s share of the effort of fixing it. However, no matter how much we tried we could not reproduce it. No matter what we changed, we always got back the right request. This was good, because it ruled out a widespread problem in our systems, but did not get us closer to figuring out the problem.
Putting aside reproducing it for a while, we then moved on to combing very carefully through our code, trying to find any hints of what could be causing this. Several of us looked at it with fresh eyes so we wouldn’t be tainted by having developed or reviewed the code, but we all still came up empty each and every time. Our code seemed perfectly correct.
We then went on to a close examination of all related requests to get new clues to where the problem was hiding. But we had a big challenge with this. As developers we don’t get access to any production information that could identify people. This is good for user privacy, of course, but made it hard to produce useful logs. We invested some effort to work around this while maintaining user privacy by creating a way to anonymise the logs in a way that would still let us find patterns in them. This effort turned up the first real clue.
We use Squid to cache data for each user, so that when they re-request the same data, it’s queued up right in memory and can be quickly served to them without having to recreate the data from the databases and other services. In those anonymized Squid logs, we saw cookie-authenticated requests that didn’t contain an HTTP Vary header at all, where we expected it to have at the very least “Vary: Cookie” to ensure Squid would only serve the correct content all the time. So we then knew what was happening, but not why. We immediately pulled Squid out of the middle to stop this from happening.
Why was Squid not logging Vary headers? There were many possible culprits for this, so we got a *lot* of people were involved searching for the problem. We combed through everything in our frontend stack (Apache, Haproxy and Squid) that could sometimes remove Vary headers.
This was made all the harder because we had not yet fully Juju charmed every service, so could not easily access all configurations and test theories locally. Sometimes technical debt really gets expensive!
After this exhaustive search, we determined that nothing our code removed headers. So we started following the code up to Django middlewares, and went as far as logging the exact headers Django was sending out at the last middleware layer. Still nothing.
Part 3: The Climax
Until we got a break. Logs were still being generated, and eventually a pattern emerged. All the initial requests that had no Vary headers seemed for the most part to be from Internet Explorer. It didn’t make sense that a browser could remove headers that were returned from a server, but knowing this took us to the right place in the Django code, and because Django is open source, there was no friction in inspecting it deeply. That’s when we saw it.
In a function called fix_IE_for_vary, we saw the offending line of code.
We finally found the cause.
It turns out IE 6 and 7 didn’t have the HTTP Vary header implemented fully, so there’s a workaround in Django to remove it for any content that isn’t html or plain text. In hindsight, if Django would of implemented this instead as a middleware, even if default, it would have been more likely that this would have been revised earlier. Hindsight is always 20/20 though, and it easy to sit back and theorise on how things should have been done.
So if you’ve been serving any data that wasn’t html or plain text with a caching layer in the middle that implements Vary header management to-spec (Varnish doesn’t trust it by default, and checks the cookie in the request anyway), you may have improperly returned a request.
Newer versions if Internet Explorer have since fixed this, but who knew in 2008 IE 9 would come 3 years later?
Part 4: Falling Action
We immediately applied a temporary fix to all our running Django instances in Canonical and involved our security team to follow standard responsible disclosure processes. The Canonical security team was now in the driving seat and worked to assign a CVE number and email the Django security contact with details on the bug, how to reproduce it and links to the specific code in the Django tree.
The Django team immediately and professionally acknowledged the bug and began researching possible solutions as well as any other parts of the code where this scenario could occur. There was continuous communication among our teams for the next few days while we agreed on lead times for distributions to receive and prepare the security fix,
Part 5: Resolution
I can’t highlight enough how important it is to follow these well-established processes to make sure we keep the Internet at large a generally safe place.
To summarise, if you’re running Django, please update to the latest security release as quickly as possible, and disable any internal caching until then to minimise the chances of hitting this bug.
If you're running squid and want to check if you could be affected, here's a small python script to run against your logs we put together you can use as a base, you may need to tweak it based on your log format. Be sure to run it only against cookie-authenticated URLs, otherwise you will hit a lot of false positives.