Work in progress
Submitted by Alexis Wilke on Wed, 09/01/2010 - 22:42
I'm sorry for the company, but I can understand that, technically, it is quite a challenge to make such a network run smoothly 24/71.
This post is an attempt to show non-technical people what we do, silly billies like me, to allow millions of people to use one single Internet service such as Twitter, all at the same time. The concept is similar for all web based systems: Google, Facebook, Snap! Websites, etc.
Large systems, such as Twitter, use three main parts in their network:
Although each part can be a bottleneck, in most cases, only #2 and #3 generate real problems. If you receive too many connection requests at once, then the number of workers will be too small to handle further connections. In case of Twitter, the database server can be a problem for large accounts (accounts with very many messages or connections,) or when too many workers need to be accessed simultaneously.
As I show you on the picture, each time you go to Twitter, it goes through what we call a proxy server. This is a fancy technical name to talk about a web server helping in distributing work between many other real web servers. Real in the sense that they will deal with the real data. Let's say that each worker can handle 3 connections at the same time, a 4th would make the worker slow down too much, so we stop at 3. This mean once the worker was given 3 connections, it tells the proxy server: « Sorry! I'm busy, check on me later... » At that point, the Proxy server decides which other worker will take over for the next few connections. A little later, that worker is also running at capacity and asks the Proxy server to stop sending requests. So again, the Proxy server selects another worker... and it does so as long as requests arrive and workers are available.
If the Proxy server cannot find a worker that's available, it throws out the infamous error "Twitter is over capacity". Note that this means millions are receiving their answers as expected... since the system is still running and fulfilling many requests. Just... not yours. When such errors occur, the Proxy server stops many requests for a small amount of time to make sure that enough workers get freed up and the availability levels are good to start serving you as usual. This is why in most cases, when you try to reload the page right away, you get the same error message. It is nothing against you, but the system has to continue to work for people who are currently being served.
The last part is the database system. When your network is properly setup, the Proxy server does not need any direct connections to that part of the system. Instead, it relies on the workers to know the global system status. This does not mean the database is not also a bottleneck, it just means it is handled in a completely different manner and here are the basics:
When you learn about data handling on a computer, the first thing you are taught is: no duplication. I'm sure Twitter, Google, Facebook, LinkedIn, YouTube and many others do not follow that rule (Snap! Websites is not yet running millions of websites—soon though! —and it does not respect that rule either, it has 4 to 5 levels of duplications already.) Data duplication is called caching. Yet, caching or not, Twitter does have a bottleneck at that level too. This means you can always dream of adding many more workers to handle more connections and be all good with the load, it will still break if the database can't be instantaneously duplicated, and chances are, it can't.
I can tell you that caching is the most well known and used mechanism, but not the only one available. In case of Google, they use extreme data distribution by highly categorizing what they index. They do so by assign each set of servers a specific category such as sports, science, adult content, hacker websites, Internet marketing, social media... On the front end, where you enter your search sentence, they have a complex algorithm to categorize your query on the fly. Twitter sure uses both methods. If you are reading this blog, chances are you could not read Japanese. The fact is that most Japanese will follow and be followed by Japanese... So here you can already heavily distribute Twitter content. There are many other criteria that Twitter can use. For instance, well known people such as Lady Gaga, Tom Cruise, Arnold Schwarzenegger, Madonna, Obama Barrack, tend to get a lot more followers... and you can optimize for that specific situation.
So... you may have been wondering why your website just goes down when you receive to many hits and a web service such as Twitter is capable to send you a page explaining the problem and asking you to try again later. Here you've got the answer. The Proxy server can handle well over 50,000 hits a second (that's 3 millions an hour, 72M/day). That's easy for the Proxy server, it does nearly nothing. The workers, however, have to deal with a lot of data and cannot generally be that fast.
If you think I missed something, click on Add a comment below and ask me for details.
I got that Facebook error many times in the last few days... I have to say, with the Page Like, it's like I'm on Facebook all day!
Just got Twitter is over capacity tonight. It has been slow in the past few days... but I guess this is Saturday night?!