Log in

No account? Create an account

Previous Entry | Next Entry

Erlang, Yaws, and the deadly Tornado

Good things sometimes happen to the open source community. Since Facebook acquisition of FriendFeed, a bunch of technologies were released to the wild, including, most notably, a Tornado web server written in Python. The Tornado is touted as a «a scalable, non-blocking web server and web framework». See Wikipedia article http://en.wikipedia.org/wiki/Tornado_HTTP_Server on some details on the performance of that server, as well as some comparison with other web servers.

Here's the chart, taken from Wikipedia:

Performance on AMD Opteron, 2.4GHz, 4 Cores
Server Setup Requests per Second
Tornado nginx, 4 frontends 8213
Tornado 1 single threaded frontend 3353
Django Apache/mod_wsgi 2223
web.py Apache/mod_wsgi 2066
CherryPy standalone 785

The numbers looked interesting, so I decided to benchmark Tornado myself to check out how it fares against some Erlang tools. Keep in mind that Erlang runtime itself is not the fastest beast in the woods. It is generally considered slower than many other interpreted languages (including Python), especially on file operations (due to complexities of the io library doing most of heavy lifting). However, the network I/O, message passing and [green] process spawning are quite fast, so people use Erlang quite extensively (comparatively) as a nice web backend. Facebook itself uses Erlang for the Facebook Chat application:
For Facebook Chat, we rolled our own subsystem for logging chat messages (in C++) as well as an epoll-driven web server (in Erlang) that holds online users' conversations in-memory and serves the long-polled HTTP requests. Both subsystems are clustered and partitioned for reliability and efficient failover. Why Erlang? In short, because the problem domain fits Erlang like a glove. Erlang is a functional concurrency-oriented language with extremely low-weight user-space "processes", share-nothing message-passing semantics, built-in distribution, and a "crash and recover" philosophy proven by two decades of deployment on large soft-realtime production systems.

There are a few web servers for Erlang VM, notably Yaws and Mochiweb. Yaws is positioned as the most general purpose (and most mature) web server, resembling Apache of imperative world. Mochiweb, in turn, is mostly a special purpose embedded web server (though Yaws can be embedded too).

Here's a nice comparison of Yaws, Mochiweb and Nginx: http://www.joeandmotorboat.com/2009/01/03/nginx-vs-yaws-vs-mochiweb-web-server-performance-deathmatch-part-2/

Since I know Yaws performance very well (several thousand requests per second on modern hardware, generally a very competitive piece of software), I was interested in comparing it to Tornado using some sort of a stress test.

But soon I realized that I also wanted to measure some baseline Erlang performance. Yaws does a bit of heavy lifting under the hood, which is not always valuable, especially in embedded environment. We can do better. So, I sat today at the Specialty's and implemented a small web server from scratch, using the newly documented Erlang's http packet filter. Name's Yucan (does not mean anything).

So, meet Yucan. Here's the front of the web server: a central TCP acceptor loop. See how easy it is to spawn a process per connection:

tcpAcceptor(Srv, ListeningSocket) ->
        case gen_tcp:accept(ListeningSocket) of
                {ok, Sock} ->
                        Pid = spawn(fun () ->
                                receive permission ->
                                        inet:setopts(Sock, [
                                                {packet, http_bin},
                                                {active, true}
                                after 60000 -> timeout
                                collectHttpHeaders(Srv, Sock,
                                        tstamp()+?HTTP_HDR_RCV_TMO, [])
                        gen_tcp:controlling_process(Sock, Pid),
                        Pid ! permission,
                        tcpAcceptor(Srv, ListeningSocket);
                {error, econnaborted} ->
                        tcpAcceptor(Srv, ListeningSocket);
                {error, closed} -> finished;
                Msg ->
                        error_logger:error_msg("Acceptor died: ~p~n", [Msg]),

Here's Yucan's request header assembler, using the convenient http packet filter provided by Erlang:

collectHttpHeaders(Srv, Sock, UntilTS, Headers) ->
  Timeout = (UntilTS - tstamp()),
    % Add this next header into the pile of already received headers
    {http, Sock, {http_header, _Length, Key, undefined, Value}} ->
        collectHttpHeaders(Srv, Sock, UntilTS,
                [{header, {Key,Value}}|Headers]);

    {http, Sock, {http_request, Method, Path, HTTPVersion}} ->
        collectHttpHeaders(Srv, Sock, UntilTS,
                [{http_request, decode_method(Method), Path, HTTPVersion}
                        | Headers]);

    {http, Sock, http_eoh} ->
        inet:setopts(Sock, [{active, false}, {packet, 0}]),
        reply(Sock, lists:reverse(Headers),
                fun(Hdrs) -> dispatch_http_request(Srv, Hdrs) end);

    {tcp_closed, Sock} -> nevermind;

    Msg -> io:format("Invalid message received: ~p~nAfter: ~p~n",
                [Msg, lists:reverse(Headers)])
  after Timeout ->
        reply(Sock, Headers,
                fun(_) -> [{status, 408, "Request Timeout"},
                        {header, {<<"Content-Type: ">>, <<"text/html">>}},
                        {html, "<html><title>Request timeout</title>"
                                "<body><h1>Request timeout</h1></body></html>"}]

I also wanted to get a feeling of the TCP listening backlog effect on that web server, so I did a number of tests for different backlogs: 1, 5, 128, 256. And, for the sake of completion, I also intended to run the stress tests against a single-thread and SMP-enabled Erlang VM configurations.

For a testing engine, I drafted a perl wrapper around the old httperf routine, which throws 1000, 2000, …, 10000 requests per second at a web site a number of times, averages data, captures error rates, and saves the result into a CSV for colorful graphing. There's nothing fancy about this perl wrapper, here it is.

Test bed was a 4 core 2.5GHz Xeon L5420 running the web server, and another such system as a source of requests. FreeBSD-7.2. Erlang R13B01. HiPE did not make a sound difference, see my email to erlang-questions.

Here are the graphs for the different TCP listening backlog and SMP/Non-SMP variables. It shows backlog of 128 entries as a sweet spot irrespectively of SMP mode. Incidentally, a Tornado web server also uses backlog of 128 by default. Yaws uses 5, which is Erlang's gen_tcp's default value.

Non-SMP Yucan (1 core)SMP Yucan (4 core)Side notes
About 3k requests per second for Non-SMP, and surprizing 2kRPS for SMP. Not good. Understandable. Red stuff means errors, normalized; red value should be as close to zero as possible. 100 on this scale means 1% requests never finished or finished badly.
Here, with backlog of 5 we see almost 3k requests per second for non-SMP system and a satisfactory almost 4k for SMP system.
Best backlog value!

A tiny bit better than before on Non-SMP front and a great margin better on SMP configuration. 8k RPS for sure, maybe even honest 8500.
Clearly, more is not always better. 256 entries long TCP backlog hurts performance noticeably in both SMP and Non-SMP systems. But we can state 3k/8k requests per second anyway.

Now, since we see that TCP listening backlog of 128 is a sweet spot for at least Yucan, and also is a default setting for Tornado, let's fix that backlog setting at 128. First, let's compare Yucan and Yaws side by side:

Oh, my dear! What the hell is that? Whereas Yucan runs close to 8500 requests per second on 4 cores, Yaws is only 2k, maybe 2.5k per second on the same SMP system! It can be explained to a degree by the fact that I used the production configuration for Yaws, with custom #arg rewriter which adds a bit to the running time. Also, Yaws itself is not the simplest piece of code, and perhaps has accumulated some inefficiencies over time which prevent it from scoring well against 180 lines of Yucan.

But anyway, Yaws' 2k RPS is for a production configuration, not just a tiny benchmark.

Let's go to the Tornado web server test, which is clearly a tiny benchmark (see http://www.tornadoweb.org/, I just copied these 15 lines of code off that page and used it). We switch Yucan to the Non-SMP mode to compare apples with apples.

The good part: in a single thread configuration (listed as 3.3k RPS on AMD 2.4 GHz) it showed 4k RPS on my 2.5 GHz Xeon. Which is clearly faster than Yucan's 3.5k RPS in the same single thread configuration.

The bad part: Tornado is touted as a scalable thing, but it does appear to require nginx load balancer in front of the farm of independent Tornado processes (each will end up running on its own core, mostly) to show its scalability. This has a clear disadvantage in communication: in order to exchange data between these independent processes, a Tornado application will have to use some form of IPC (Thrift, JSON, XMLRPC, etc). Erlang Yucan proves to be much better in this respect: it can scale up to 8k by just giving the erlang VM -smp enable flag. That's it: no complex set up, just a flag, and no changes to the application whatsoever. Yucan was written with at least two contention points: the TCP acceptor and a dispatcher lookup table process. And nevertheless, it scaled well, because Erlang has found opportunities for parallelization even in that code.

The deadly part: Tornado has funneled under load!

At some point while httperf was doing a 6000 requests per second test round, the Tornado web server died with the following diagnostics:
ERROR:root:Exception in I/O handler for fd 5
Traceback (most recent call last):
  File "/home/vlm/tornado-0.2/tornado/ioloop.py", line 189, in start
    self._handlers[fd](fd, events)
  File "/home/vlm/tornado-0.2/tornado/httpserver.py", line 94, in _handle_events
    connection, address = self._socket.accept()
  File "/usr/local/lib/python2.6/socket.py", line 195, in accept
    sock, addr = self._sock.accept()
error: [Errno 53] Software caused connection abort
Traceback (most recent call last):
  File "./ws.py", line 18, in >module>
  File "/home/vlm/tornado-0.2/tornado/ioloop.py", line 173, in start
    event_pairs = self._impl.poll(poll_timeout)
  File "/home/vlm/tornado-0.2/tornado/ioloop.py", line 340, in poll
    self.read_fds, self.write_fds, self.error_fds, timeout)
ValueError: filedescriptor out of range in select()
[vlm@yucan ~/tornado-0.2]$

Neither Yucan nor Yaws allowed themselves such a liberty. Yes, even in Erlang certain things (in an isolated processes) can go wrong, but Erlang is specifically designed to be resilient to programming failures by adopting share-nothing semantics, message passing, process linking and supervision, and other nice concepts. Taken together, these things greatly simplify programmer's life, while Erlang VM produces more than acceptable out-of-the-box performance on real life tasks.

So, here we are. The data are open to further interpretation.

Update: Roberto Ostinelli has contacted me asking to perform the same set of tests against the trunk version of misultin. Misultin (pronounced mee-sul-teen) is an Erlang library for building fast lightweight HTTP servers. Due to the fact that the same design criteria were used for misultin (e.g., embeddability and lean code), I presumed it would very closely match Yucan performance. However, please note that the code uses TCP backlog of 30 by default for some reason, which proved to be a bit less optimal in my Yucan tests (I did Yucan test with 64 backlog entries and it was a tiny bit worse than the one with 128 entries).

Anyway, here's the data (misultin-smp-30.csv):

Rate,Received reply rate,Normalized error rate (1/100%),"Generated request rate (also, expected reply rate)",Error rate,Attempt 1, Attempt 2, Attempt 3, Error 1, Error 2, Error 3
1000 rps,1000,0,1000,0,1000,1000,1000,0,0,0
2000 rps,1999,0,2000,0,2000,2000,1999,0,0,0
3000 rps,2999,0,3000,0,2999,2999,3000,0,0,0
4000 rps,3997,0,4000,0,3997,3997,3997,0,0,0
5000 rps,4998,0,5000,0,4998,4998,4998,0,0,0
6000 rps,5999,0,6000,0,5999,5999,5999,0,0,0
7000 rps,7001,0,7000,0,7001,7001,7001,0,0,0
8000 rps,5725,1600,8000,16,5985,4049,7141,29,9,10
9000 rps,7364,1700,9000,17,7187,7044,7862,21,13,17
10000 rps,6070,2733,10000,27,5759,6333,6119,29,27,26

Looking at these numbers, it is clear that misultin and Yucan are very similar in performance and load handling. Yucan starts to turn its nose at 9k RPS (5% errors), misultin is a bit earlier at 8k (16% errors). I can only applaud Roberto Ostinelli for developing this server, and recommend it to others, especially since it is incomparably more mature than my today's experiment with Yucan.

Warning: epoll/select: If you think you have discovered a potential problem with my test, and this problem is the lack of epoll use in Tornado, you are right. However, while using epoll (I will have to find Linux somewhere, which is not a trivial task due to relative scarcity of such systems) will almost certainly fix the Tornado crash problem, this is only part of the story. The other part is the baseline performance of Tornado as compared to other web servers, and here is where it gets interesting. My assessment is that enabling epoll will not help with its baseline performance. Why? Read my replies to several commenters below, wrt. the number of open sockets during the tests. If you want to repeat my test on a comprable Linux system, you are encouraged to do so, since webserver-benchmark.pl is available. I'll gladly publish the results here.

Translation: if you think epoll is better than select on a tiny number of hyper-active file descriptors, you are poised to do some reading. See


( 55 comments — Leave a comment )
Page 1 of 2
<<[1] [2] >>
Sep. 19th, 2009 09:20 am (UTC)
Have you tried to run the test on Linux OS?
As far as I know Tornado uses epoll on Linux which should bring better performance.
Sep. 19th, 2009 09:37 am (UTC)
Re: epoll
No, I haven't tried it on Linux with epoll. You are right that it uses epoll when available and falls back to select() where not.

However, I do not think that epoll/select() is going to affect that particular test noticeably. For a single-threaded process the sustained load over fast switched GigE LAN is normally handled using a small number of file descriptors (just enough to mask LAN latency). The typical number of parallel connections for 3k load was 10-30, for example. This is easy to understand by assuming that if many more connections are needed for a sustained load benchmark, it would mean that Python code is lagging behind and is not capable to cope with incoming requests over the long term. In that case, the lagging behind fact will clearly show up on the graphs at some point, by affecting the number of erroneous connections and decreasing throughput.

See the typical output of httperf:
Connection rate: 2426.9 conn/s (0.4 ms/conn, <=20 concurrent connections)
Connection time [ms]: min 1.2 avg 3.5 max 3000.9 median 2.5 stddev 60.0
Connection time [ms]: connect 1.5
Connection length [replies/conn]: 1.000

Request rate: 2426.9 req/s (0.4 ms/req)
Request size [B]: 72.0

The epoll/select (epoll/poll) distinction is important for the simulated "web crowd" load, where there are lots of idle connections (in keep-alive state) and a relatively small number of bursty transactions with short turnaround. In this case, epoll would help by keeping a pile of idle connections off the userland awareness. I was not doing that kind of test in the above setup.

In any case, if you want to confirm or disprove my point, you may want to try it on Linux yourself, since benchmark utility is provided.
Re: epoll - lionet - Sep. 19th, 2009 09:46 am (UTC) - Expand
Sep. 19th, 2009 10:28 am (UTC)
I think that using epoll instead of select + some tuning on available number of open fds in system will solve problem with tornado.
But anyway, I wonder why tornado developers are are using they own python-coded event-loop instead of battle-tested libevent or libec C-coded loop.
Sep. 19th, 2009 10:37 am (UTC)
braintrace, I don't think using epoll or additional tuning will help with performance. See above why. Though I too believe it is going to fix the catastrophic failure. Yet again, as you can see, the blow up is not the whole point of my post. The performance is, largely.
Sep. 19th, 2009 12:03 pm (UTC)
What's interesting about Tornado - it is framework as well. But frameworks are CPU bound. And thousands req/sec is too synthetic.

From http://www.tornadoweb.org/documentation#performance:

"We ran a few remedial load tests on a simple "Hello, world" application in each of the most popular Python web frameworks (Django, web.py, and CherryPy) to get the baseline performance of each relative to Tornado"

If you take simple box with nginx + php-fpm with "Hello, world!" it usually gives ~10K req/sec!
I just wanted to note that if you want to test "framework" part - test it in CPU-bound environment. If you want to test "server" part... Isn't it nginx here? ;)
Sep. 19th, 2009 12:31 pm (UTC)
Looks interesting.
Tornado has some interesting ideas to steal, but I still prefer twisted.web as more mature solution.
Sep. 19th, 2009 01:44 pm (UTC)
Yes, even in Erlang certain things (in an isolated processes) can go wrong

I've no comments. ;)
Sep. 19th, 2009 03:40 pm (UTC)
Reqs per second are easy to measure but not that relevant in a production context. I'd be interested to see what happens when you saturate the number of concurrent requests on both servers. Real load == lots of clients.
Sep. 19th, 2009 08:31 pm (UTC)
Re: Concurrency
Depending on the load, admittedly, but depending on COMET implementation, you either have extremely long polling, or extremely fast turn around of transactions. The first one is going to favorably highlight epoll, the last one will not.
Sep. 19th, 2009 06:52 pm (UTC)
The epoll module was not properly compiled?
As much as a I hate to perpetuate these "Tornado vs X" blog posts, I wanted to point out that I don't think you have configured Tornado properly. The error you printed was a stack trace from a call to select(). In a properly configured setup, Tornado should use epoll(), not select(). The two have significantly different performance characteristics under high load / with lots of file descriptors. Did you run this on Linux and run setup.py build in the package?

Let me know if you have any trouble (I am one of the co-creators of Tornado).

Bret Taylor
Sep. 19th, 2009 08:29 pm (UTC)
Re: The epoll module was not properly compiled?

Please read the earlier explanation which shows that epoll() would not be advantageous for the kind of setup I was having. I admit that configuring the Tornado server for using epoll() would most certainly eliminate the crash itself, but I insist that it would not change the performance characteristic during the non-overloaded part of the performance curve. The main observation is that there were no "lots of file descriptors" during the non-overloaded phase of operation. Select was doing just fine for a dozen of open connections. As far as how epoll helps when the system is overloaded—it does not. It certainly helps when there are lots of inactive sockets, but this test was not designed to exercise that pattern. See my earlier messages for an explanation.
Sep. 19th, 2009 10:42 pm (UTC)
Big difference between select and epoll
I have tested both epoll and select on Linux with Python. Once getting into thousands of requests a second, there is a big difference. Try it and see.
Sep. 19th, 2009 10:44 pm (UTC)
Re: Big difference between select and epoll
If you have tested it relatively recently, would you use my script (webserver-benchmark.pl) and share the results?

Otherwise, we are risking comparing different metrics.
Re: Big difference between select and epoll - lionet - Sep. 19th, 2009 10:53 pm (UTC) - Expand
Sep. 20th, 2009 06:17 am (UTC)
Implementation weakness
Hi Lev,

The problems shown in your benchmark are caused by well known inet:setopts(Sock, [{active, true} ...])

Basically if the load is high the time between you accept the socket and switch active to false, you can have the process message queue overloaded already. The elegant implementation was presented here: http://fullof.bs/2008/12/31/ and here: http://trapexit.org/Building_a_Non-blocking_TCP_server_using_OTP_principles.

Daniel Kwiecinski
Sep. 20th, 2009 11:24 am (UTC)
Re: Implementation weakness

The problems shown in your benchmark are caused by well known inet:setopts(Sock, [{active, true} ...])

What problems, by the way? The fact that Yucan seems to be very fast? I don't think this is a problem.

The misultin server I benchmarked at the end of the post used exactly the approach presented at trapexit.org. It showed slower performance, albeit very close to Yucan's.

You also have to keep in mind that:

1. The {packet, http} mode stops sending messages to the queue after http_eoh message, so no queue overload is going to happen after all headers are parsed. The process have to disable that filter by issuing {packet, 0} ({packet, raw}) in order to continue receiving messages caused by subsequent (pipelined) HTTP requests, or long request bodies (i.e., POST). In my cases, there were no POSTs and no pipelining. The Connection: close mode was employed.

2. The misultin server practically mirrors the approach presented at trapexit.org, and even contains trapexit.org's copyright in the source code. It runs a bit slower, which can be attributed by slightly heavier overheads of a more mature code base.

Overall, your assessment is not correct. There are no problems in my benchmark caused by {active, true}.
Re: Implementation weakness - levgem - Oct. 2nd, 2009 07:31 pm (UTC) - Expand
Sep. 21st, 2009 09:46 am (UTC)
Erlang Web Framework???
Thanks for your interesting post and sharing your results!
My question: could you point us to a nice Erlang Web Framework that has similar features like tornadoweb or any of the well known frameworks? I do not know anything about Erlang, but this will definitely change this week :)
Sep. 25th, 2009 04:38 pm (UTC)
Build Your Next Web Application with Erlang

The growing interest in applying Erlang/OTP to Web services and Web applications is driving the development of several interesting open source projects. In this column, we’ll look at some of the more popular Erlang Web frameworks and Web servers.

Sep. 24th, 2009 10:58 am (UTC)
Лев, извините за назойливость, но я был бы весьма признателен за следующую информацию:
какие есть хорошие практики деплоя эрланга на продакшне?

Например, если я деплою на продакшне самописный демон на C, я собираю для дебиана пакет, в /etc убираю конфигурацию, в /etc/monit.d кладу правило слежения монитом за инстансом, в /var/log/NNN/ он пишет свои логи.

Если мне надо выкатить плагин для Ruby on Rails, я создаю репозиторий с известной структурой, так же добавляю то что надо в монит и всё отправляется в виде сабмодуля в основную ветку, которая выкатывается с помощью капистрано (утилита такая) на продакшн.

В случае с эрлангом мне не очень понятно, чего и как делать. Есть какие-то средства пакетирования приложений, но я не очень понимаю, какие из них устарелые, а какие наоборот — поддерживаемые. Да чего далеко ходить: rabbitmq и тот, свой пид для monit-а не пишет в файл =(

Если не сложно, сообщите по каким ключевым словам искать в гугле?

И второй вопрос: вот мы нашли крысятину в нашем падающем RabbitMQ. Им оказался глючащий error_logger, который сожрал 20 гигабайт. Вы как-нибудь отслеживаете разжирающиеся процессы?
Sep. 24th, 2009 05:11 pm (UTC)
1. Мы используем `svn up` в качестве механизма деплоймента. Грубо говоря, `svn up && make upgrade`, где в Makefile прописана процедура компиляции (в .beam) и подгрузки модулей в запущенный экземпляр Erlang VM. Мы не используем стандартные эрланговские методы управления релизами.
2. У нас собственные средства мониторинга, рестарта процессов, и слежения за убеганием памяти.

Надо понимать, что текущая структура наросла "органически", и что копировать её кому-то советовать нельзя. Если бы я делал сервис заново, я бы пользовался более стандартными и/или распространёнными средствами деплоймента, типа capistrano и эрланговской стандартной системы апгрейда OTP.
(no subject) - levgem - Sep. 24th, 2009 06:19 pm (UTC) - Expand
(no subject) - levgem - Sep. 29th, 2009 01:00 pm (UTC) - Expand
(no subject) - lionet - Sep. 29th, 2009 01:06 pm (UTC) - Expand
Oct. 1st, 2009 05:49 am (UTC)
Это к чему вообще?
(no subject) - illy_drinker - Oct. 1st, 2009 07:22 am (UTC) - Expand
Oct. 2nd, 2009 05:12 pm (UTC)
offtopic: о! катап к тебе пришёл.
Oct. 16th, 2009 12:49 pm (UTC)
Преимущество же не только в скорости, но и в необходимости переучивать или нет сотрудников. Хотя сотрудник со знанием Erlang и Python, лучше чем просто сотрудник со знанием Python.

Жалко, что не было торнадо когда мы запускали чат на 10к посетителей (не хитов в секунду, а просто посетителей), Twisted умирал где-то при 8к.
Nov. 3rd, 2009 08:40 pm (UTC)
помедитировав на графики
1. у питона дорогая процедура акцепта. потому баклог помогает
2. что-то странное у торнадо, может он без реюза или как-то так сокеты открывает? или для пипелайнинга держит?
Nov. 3rd, 2009 09:01 pm (UTC)
Re: помедитировав на графики
1. Ошибка в нём просто. Вопрос не в том, что он падает (там ошибка в коде, допускающая создание больше N файловых дескрипторов), а в том, какой перформанс, по сравнению с эрлангом, он выдаёт, пока не упал.

2. Некоторый баклог всем помогает, и эрлангу, и си.
Page 1 of 2
<<[1] [2] >>
( 55 comments — Leave a comment )