Now that we've open sourced the code for Ubuntu One filesync, I thoughts I'd highlight some of the interesting challenges we had while building and scaling the service to several million users.
The teams that built the service were roughly split into two: the foundations team, who was responsible for the lowest levels of the service (storage and retrieval of files, data model, client and server protocol for syncing) and the web team, focused on user-visible services (website to manage files, photos, music streaming, contacts and Android/iOS equivalent clients).
I joined the web team early on and stayed with it until we shut it down, so that's where a lot of my stories will be focused on.
Today I'm going to focus on the challenge we faced when launching the Photos and Music streaming services. Given that by the time we launched them we had a few years of experience serving files at scale, our challenge turned out to be in presenting and manipulating the metadata quickly to each user, and be able to show the data in appealing ways to users (showing music by artist, genre and searching, for example). Photos was a similar story, people tended to have many thousands of photos and songs and we needed to extract metadata, parse it, store it and then be able to present it back to users quickly in different ways. Easy, right? It is, until a certain scale
Our architecture for storing metadata at the time was about 8 PostgreSQL master databases where we sharded metadata across (essentially your metadata lived on a different DB server depending on your user id) plus at least one read-only slave per shard. These were really beefy servers with a truck load of CPUs, more than 128GB of RAM and very fast disks (when reading this, remember this was 2009-2013, hardware specs seem tiny as time goes by!). However, no matter how big these DB servers got, given how busy they were and how much metadata was stored (for years, we didn't delete any metadata, so for every change to every file we duplicated the metadata) after a certain time we couldn't get a simple listing of a user's photos or songs (essentially, some of their files filtered by mimetype) in a reasonable time-frame (less than 5 seconds). As it grew we added caches, indexes, optimized queries and code paths but we quickly hit a performance wall that left us no choice but a much feared major architectural change. I say much feared, because major architectural changes come with a lot of risk to running services that have low tolerance for outages or data loss, whenever you change something that's already running in a significant way you're basically throwing out most of your previous optimizations. On top of that as users we expect things to be fast, we take it for granted. A 5 person team spending 6 months to make things as you expect them isn't really something you can brag about in the middle of a race with many other companies to capture a growing market.
In the time since we had started the project, NoSQL had taken off and matured enough for it to be a viable alternative to SQL and seemed to fit many of our use cases much better (webscale!). After some research and prototyping, we decided to generate pre-computed views of each user's data in a NoSQL DB (Cassandra), and we decided to do that by extending our existing architecture instead of revamping it completely. Given our code was pretty well built into proper layers of responsibility we hooked up to the lowest layer of our code,-database transactions- an async process that would send messages to a queue whenever new data was written or modified. This meant essentially duplicating the metadata we stored for each user, but trading storage for computing is usually a good trade-off to make, both in cost and performance. So now we had a firehose queue of every change that went on in the system, and we could build a separate piece of infrastructure who's focus would only be to provide per-user metadata *fast* for any type of file so we could build interesting and flexible user interfaces for people to consume back their own content. The stated internal goals were: 1) Fast responses (under 1 second), 2) Less than 10 seconds between user action and UI update and 3) Complete isolation from existing infrastructure.
Here's a rough diagram of how the information flowed throw the system:
It's a little bit scary when looking at it like that, but in essence it was pretty simple: write each relevant change that happened in the system to a temporary table in PG in the same transaction that it's written to the permanent table. That way you get transactional guarantees that you won't loose any data on that layer for free and use PG's built in cache that keeps recently added records cheaply accessible.
Then we built a bunch of workers that looked through those rows, parsed them, sent them to a persistent queue in RabbitMQ and once it got confirmation it was queued it would delete it from the temporary PG table.
Following that we took advantage of Rabbit's queue exchange features to build different types of workers that processes the data differently depending on what it was (music was stored differently than photos, for example).
Once we completed all of this, accessing someone's photos was a quick and predictable read operation that would give us all their data back in an easy-to-parse format that would fit in memory. Eventually we moved all the metadata accessed from the website and REST APIs to these new pre-computed views and the result was a significant reduction in load on the main DB servers, while now getting predictable sub-second request times for all types of metadata in a horizontally scalable system (just add more workers and cassandra nodes).
All in all, it took about 6 months end-to-end, which included a prototype phase that used memcache as a key/value store.
You can see the code that wrote and read from the temporary PG table if you branch the code and look under: src/backends/txlog/
The worker code, as well as the web ui is still not available but will be in the future once we finish cleaning it up to make it available. I decided to write this up and publish it now because I believe the value is more in the architecture rather than the code itself
I'm a few days away from hitting 6 years at Canonical and I've ended up doing a lot more management than anything else in that time. Before that I did a solid 8 years at my own company, doing anything from developing, project managing, product managing, engineering managing, sales and accounting.
This time of the year is performance review time at Canonical, so it's gotten me thinking a lot about my role and how my view on engineering management has evolved over the years.
A key insights I've had from a former boss, Elliot Murphy, was viewing it as a support role for others to do their job rather than a follow-the-leader approach. I had heard the phrase "As a manager, I work for you" a few times over the years, but it rarely seemed true and felt mostly like a good concept to make people happy but not really applied in practice in any meaningful way.
Of all the approaches I've taken or seen, a role where you're there to unblock developers more than anything else, I believe is the best one. And unless you're a bit power-hungry on some level, it's probably the most enjoyable way of being a manager.
It's not to be applied blindly, though, I think a few conditions have to be met:
1) The team has to be fairly experienced/senior/smart, I think if it isn't it breaks down to often
2) You need to understand very clearly what needs doing and why, and need to invest heavily and frequently in communicated it to the team, both the global context as well as how it applies to them individually
3) You need to build a relationship of trust with each person and need to trust them, because trust is always a 2-way street
4) You need to be enough of an engineer to understand problems in depth when explained, know when to defer to other's judgments (which should be the common case when the team generally smart and experienced) and be capable of tie-breaking in a technical-savvy way
5) Have anyone who's ego doesn't fit in a small, 100ml container, leave it at home
There are many more things to do, but I think if you don't have those five, everything else is hard to hold together. In general, if the team is smart and experienced, understands what needs doing and why, and like their job, almost everything else self-organizes.
If it isn't self-organizing well enough, walk over those 5 points, one or several must be mis-aligned. More often than not, it's 2). Communication is hard, expensive and more of an art than a science. Most of the times things have seemed to stumble a bit, it's been a failure of how I understood what we should be doing as a team, or a failure on how I communicated it to everyone else as it evolved over time.
Second most frequent I think is 1), but that may vary more depending on your team, company and project.
Oh, and actually caring about people and what you do helps a lot, but that helps a lot in life in general, so do that anyway regardless of you role
Now that all the responsible disclosure processes have been followed through, I’d like to tell everyone a story of my very bad week last week. Don’t worry, it has a happy ending.
Part 1: Exposition
On May 5th we got a support request from a user who observed confusing behaviour in one of our systems. Our support staff immediately escalated it to me and my team sprung into action for what ended up being a 48-hour rollercoaster ride that ended with us reporting upstream to Django a security bug.
The bug, in a nutshell, is that when the following conditions lines up, a system could end up serving a request to one user that was meant for another:
- You are authenticating requests with cookies, OAuth or other authentication mechanisms
- The user is using any version of Internet Explorer or Chromeframe (to be more precise, anything with “MSIE” in the request user agent)
- You (or an ISP in the middle) are caching requests between Django and the internet (except Varnish’s default configuration, for reasons we’ll get to)
- You are serving the same URL with different content to different users
We rarely saw this combination of conditions because users of services provided by Canonical generally have a bias towards not using Internet Explorer, as you’d expect from a company who develops the world’s most used Linux distribution.
Part 2: Rising Action
Now, one may think that the bug is obvious, and wonder how it went unnoticed since 2008, but this really was one was one of those elusive “ninja-bugs” you hear about on the Internet and it took us quite a bit of effort to track it down.
In debugging situations such as this, the first step is generally to figure out how to reproduce the bug. In fact, figuring out how to reproduce it is often the lion’s share of the effort of fixing it. However, no matter how much we tried we could not reproduce it. No matter what we changed, we always got back the right request. This was good, because it ruled out a widespread problem in our systems, but did not get us closer to figuring out the problem.
Putting aside reproducing it for a while, we then moved on to combing very carefully through our code, trying to find any hints of what could be causing this. Several of us looked at it with fresh eyes so we wouldn’t be tainted by having developed or reviewed the code, but we all still came up empty each and every time. Our code seemed perfectly correct.
We then went on to a close examination of all related requests to get new clues to where the problem was hiding. But we had a big challenge with this. As developers we don’t get access to any production information that could identify people. This is good for user privacy, of course, but made it hard to produce useful logs. We invested some effort to work around this while maintaining user privacy by creating a way to anonymise the logs in a way that would still let us find patterns in them. This effort turned up the first real clue.
We use Squid to cache data for each user, so that when they re-request the same data, it’s queued up right in memory and can be quickly served to them without having to recreate the data from the databases and other services. In those anonymized Squid logs, we saw cookie-authenticated requests that didn’t contain an HTTP Vary header at all, where we expected it to have at the very least “Vary: Cookie” to ensure Squid would only serve the correct content all the time. So we then knew what was happening, but not why. We immediately pulled Squid out of the middle to stop this from happening.
Why was Squid not logging Vary headers? There were many possible culprits for this, so we got a *lot* of people were involved searching for the problem. We combed through everything in our frontend stack (Apache, Haproxy and Squid) that could sometimes remove Vary headers.
This was made all the harder because we had not yet fully Juju charmed every service, so could not easily access all configurations and test theories locally. Sometimes technical debt really gets expensive!
After this exhaustive search, we determined that nothing our code removed headers. So we started following the code up to Django middlewares, and went as far as logging the exact headers Django was sending out at the last middleware layer. Still nothing.
Part 3: The Climax
Until we got a break. Logs were still being generated, and eventually a pattern emerged. All the initial requests that had no Vary headers seemed for the most part to be from Internet Explorer. It didn’t make sense that a browser could remove headers that were returned from a server, but knowing this took us to the right place in the Django code, and because Django is open source, there was no friction in inspecting it deeply. That’s when we saw it.
In a function called fix_IE_for_vary, we saw the offending line of code.
We finally found the cause.
It turns out IE 6 and 7 didn’t have the HTTP Vary header implemented fully, so there’s a workaround in Django to remove it for any content that isn’t html or plain text. In hindsight, if Django would of implemented this instead as a middleware, even if default, it would have been more likely that this would have been revised earlier. Hindsight is always 20/20 though, and it easy to sit back and theorise on how things should have been done.
So if you’ve been serving any data that wasn’t html or plain text with a caching layer in the middle that implements Vary header management to-spec (Varnish doesn’t trust it by default, and checks the cookie in the request anyway), you may have improperly returned a request.
Newer versions if Internet Explorer have since fixed this, but who knew in 2008 IE 9 would come 3 years later?
Part 4: Falling Action
We immediately applied a temporary fix to all our running Django instances in Canonical and involved our security team to follow standard responsible disclosure processes. The Canonical security team was now in the driving seat and worked to assign a CVE number and email the Django security contact with details on the bug, how to reproduce it and links to the specific code in the Django tree.
The Django team immediately and professionally acknowledged the bug and began researching possible solutions as well as any other parts of the code where this scenario could occur. There was continuous communication among our teams for the next few days while we agreed on lead times for distributions to receive and prepare the security fix,
Part 5: Resolution
I can’t highlight enough how important it is to follow these well-established processes to make sure we keep the Internet at large a generally safe place.
To summarise, if you’re running Django, please update to the latest security release as quickly as possible, and disable any internal caching until then to minimise the chances of hitting this bug.
If you're running squid and want to check if you could be affected, here's a small python script to run against your logs we put together you can use as a base, you may need to tweak it based on your log format. Be sure to run it only against cookie-authenticated URLs, otherwise you will hit a lot of false positives.