Log in

No account? Create an account

Previous Entry | Next Entry

Erlang, Yaws, and the deadly Tornado

Good things sometimes happen to the open source community. Since Facebook acquisition of FriendFeed, a bunch of technologies were released to the wild, including, most notably, a Tornado web server written in Python. The Tornado is touted as a «a scalable, non-blocking web server and web framework». See Wikipedia article http://en.wikipedia.org/wiki/Tornado_HTTP_Server on some details on the performance of that server, as well as some comparison with other web servers.

Here's the chart, taken from Wikipedia:

Performance on AMD Opteron, 2.4GHz, 4 Cores
Server Setup Requests per Second
Tornado nginx, 4 frontends 8213
Tornado 1 single threaded frontend 3353
Django Apache/mod_wsgi 2223
web.py Apache/mod_wsgi 2066
CherryPy standalone 785

The numbers looked interesting, so I decided to benchmark Tornado myself to check out how it fares against some Erlang tools. Keep in mind that Erlang runtime itself is not the fastest beast in the woods. It is generally considered slower than many other interpreted languages (including Python), especially on file operations (due to complexities of the io library doing most of heavy lifting). However, the network I/O, message passing and [green] process spawning are quite fast, so people use Erlang quite extensively (comparatively) as a nice web backend. Facebook itself uses Erlang for the Facebook Chat application:
For Facebook Chat, we rolled our own subsystem for logging chat messages (in C++) as well as an epoll-driven web server (in Erlang) that holds online users' conversations in-memory and serves the long-polled HTTP requests. Both subsystems are clustered and partitioned for reliability and efficient failover. Why Erlang? In short, because the problem domain fits Erlang like a glove. Erlang is a functional concurrency-oriented language with extremely low-weight user-space "processes", share-nothing message-passing semantics, built-in distribution, and a "crash and recover" philosophy proven by two decades of deployment on large soft-realtime production systems.

There are a few web servers for Erlang VM, notably Yaws and Mochiweb. Yaws is positioned as the most general purpose (and most mature) web server, resembling Apache of imperative world. Mochiweb, in turn, is mostly a special purpose embedded web server (though Yaws can be embedded too).

Here's a nice comparison of Yaws, Mochiweb and Nginx: http://www.joeandmotorboat.com/2009/01/03/nginx-vs-yaws-vs-mochiweb-web-server-performance-deathmatch-part-2/

Since I know Yaws performance very well (several thousand requests per second on modern hardware, generally a very competitive piece of software), I was interested in comparing it to Tornado using some sort of a stress test.

But soon I realized that I also wanted to measure some baseline Erlang performance. Yaws does a bit of heavy lifting under the hood, which is not always valuable, especially in embedded environment. We can do better. So, I sat today at the Specialty's and implemented a small web server from scratch, using the newly documented Erlang's http packet filter. Name's Yucan (does not mean anything).

So, meet Yucan. Here's the front of the web server: a central TCP acceptor loop. See how easy it is to spawn a process per connection:

tcpAcceptor(Srv, ListeningSocket) ->
        case gen_tcp:accept(ListeningSocket) of
                {ok, Sock} ->
                        Pid = spawn(fun () ->
                                receive permission ->
                                        inet:setopts(Sock, [
                                                {packet, http_bin},
                                                {active, true}
                                after 60000 -> timeout
                                collectHttpHeaders(Srv, Sock,
                                        tstamp()+?HTTP_HDR_RCV_TMO, [])
                        gen_tcp:controlling_process(Sock, Pid),
                        Pid ! permission,
                        tcpAcceptor(Srv, ListeningSocket);
                {error, econnaborted} ->
                        tcpAcceptor(Srv, ListeningSocket);
                {error, closed} -> finished;
                Msg ->
                        error_logger:error_msg("Acceptor died: ~p~n", [Msg]),

Here's Yucan's request header assembler, using the convenient http packet filter provided by Erlang:

collectHttpHeaders(Srv, Sock, UntilTS, Headers) ->
  Timeout = (UntilTS - tstamp()),
    % Add this next header into the pile of already received headers
    {http, Sock, {http_header, _Length, Key, undefined, Value}} ->
        collectHttpHeaders(Srv, Sock, UntilTS,
                [{header, {Key,Value}}|Headers]);

    {http, Sock, {http_request, Method, Path, HTTPVersion}} ->
        collectHttpHeaders(Srv, Sock, UntilTS,
                [{http_request, decode_method(Method), Path, HTTPVersion}
                        | Headers]);

    {http, Sock, http_eoh} ->
        inet:setopts(Sock, [{active, false}, {packet, 0}]),
        reply(Sock, lists:reverse(Headers),
                fun(Hdrs) -> dispatch_http_request(Srv, Hdrs) end);

    {tcp_closed, Sock} -> nevermind;

    Msg -> io:format("Invalid message received: ~p~nAfter: ~p~n",
                [Msg, lists:reverse(Headers)])
  after Timeout ->
        reply(Sock, Headers,
                fun(_) -> [{status, 408, "Request Timeout"},
                        {header, {<<"Content-Type: ">>, <<"text/html">>}},
                        {html, "<html><title>Request timeout</title>"
                                "<body><h1>Request timeout</h1></body></html>"}]

I also wanted to get a feeling of the TCP listening backlog effect on that web server, so I did a number of tests for different backlogs: 1, 5, 128, 256. And, for the sake of completion, I also intended to run the stress tests against a single-thread and SMP-enabled Erlang VM configurations.

For a testing engine, I drafted a perl wrapper around the old httperf routine, which throws 1000, 2000, …, 10000 requests per second at a web site a number of times, averages data, captures error rates, and saves the result into a CSV for colorful graphing. There's nothing fancy about this perl wrapper, here it is.

Test bed was a 4 core 2.5GHz Xeon L5420 running the web server, and another such system as a source of requests. FreeBSD-7.2. Erlang R13B01. HiPE did not make a sound difference, see my email to erlang-questions.

Here are the graphs for the different TCP listening backlog and SMP/Non-SMP variables. It shows backlog of 128 entries as a sweet spot irrespectively of SMP mode. Incidentally, a Tornado web server also uses backlog of 128 by default. Yaws uses 5, which is Erlang's gen_tcp's default value.

Non-SMP Yucan (1 core)SMP Yucan (4 core)Side notes
About 3k requests per second for Non-SMP, and surprizing 2kRPS for SMP. Not good. Understandable. Red stuff means errors, normalized; red value should be as close to zero as possible. 100 on this scale means 1% requests never finished or finished badly.
Here, with backlog of 5 we see almost 3k requests per second for non-SMP system and a satisfactory almost 4k for SMP system.
Best backlog value!

A tiny bit better than before on Non-SMP front and a great margin better on SMP configuration. 8k RPS for sure, maybe even honest 8500.
Clearly, more is not always better. 256 entries long TCP backlog hurts performance noticeably in both SMP and Non-SMP systems. But we can state 3k/8k requests per second anyway.

Now, since we see that TCP listening backlog of 128 is a sweet spot for at least Yucan, and also is a default setting for Tornado, let's fix that backlog setting at 128. First, let's compare Yucan and Yaws side by side:

Oh, my dear! What the hell is that? Whereas Yucan runs close to 8500 requests per second on 4 cores, Yaws is only 2k, maybe 2.5k per second on the same SMP system! It can be explained to a degree by the fact that I used the production configuration for Yaws, with custom #arg rewriter which adds a bit to the running time. Also, Yaws itself is not the simplest piece of code, and perhaps has accumulated some inefficiencies over time which prevent it from scoring well against 180 lines of Yucan.

But anyway, Yaws' 2k RPS is for a production configuration, not just a tiny benchmark.

Let's go to the Tornado web server test, which is clearly a tiny benchmark (see http://www.tornadoweb.org/, I just copied these 15 lines of code off that page and used it). We switch Yucan to the Non-SMP mode to compare apples with apples.

The good part: in a single thread configuration (listed as 3.3k RPS on AMD 2.4 GHz) it showed 4k RPS on my 2.5 GHz Xeon. Which is clearly faster than Yucan's 3.5k RPS in the same single thread configuration.

The bad part: Tornado is touted as a scalable thing, but it does appear to require nginx load balancer in front of the farm of independent Tornado processes (each will end up running on its own core, mostly) to show its scalability. This has a clear disadvantage in communication: in order to exchange data between these independent processes, a Tornado application will have to use some form of IPC (Thrift, JSON, XMLRPC, etc). Erlang Yucan proves to be much better in this respect: it can scale up to 8k by just giving the erlang VM -smp enable flag. That's it: no complex set up, just a flag, and no changes to the application whatsoever. Yucan was written with at least two contention points: the TCP acceptor and a dispatcher lookup table process. And nevertheless, it scaled well, because Erlang has found opportunities for parallelization even in that code.

The deadly part: Tornado has funneled under load!

At some point while httperf was doing a 6000 requests per second test round, the Tornado web server died with the following diagnostics:
ERROR:root:Exception in I/O handler for fd 5
Traceback (most recent call last):
  File "/home/vlm/tornado-0.2/tornado/ioloop.py", line 189, in start
    self._handlers[fd](fd, events)
  File "/home/vlm/tornado-0.2/tornado/httpserver.py", line 94, in _handle_events
    connection, address = self._socket.accept()
  File "/usr/local/lib/python2.6/socket.py", line 195, in accept
    sock, addr = self._sock.accept()
error: [Errno 53] Software caused connection abort
Traceback (most recent call last):
  File "./ws.py", line 18, in >module>
  File "/home/vlm/tornado-0.2/tornado/ioloop.py", line 173, in start
    event_pairs = self._impl.poll(poll_timeout)
  File "/home/vlm/tornado-0.2/tornado/ioloop.py", line 340, in poll
    self.read_fds, self.write_fds, self.error_fds, timeout)
ValueError: filedescriptor out of range in select()
[vlm@yucan ~/tornado-0.2]$

Neither Yucan nor Yaws allowed themselves such a liberty. Yes, even in Erlang certain things (in an isolated processes) can go wrong, but Erlang is specifically designed to be resilient to programming failures by adopting share-nothing semantics, message passing, process linking and supervision, and other nice concepts. Taken together, these things greatly simplify programmer's life, while Erlang VM produces more than acceptable out-of-the-box performance on real life tasks.

So, here we are. The data are open to further interpretation.

Update: Roberto Ostinelli has contacted me asking to perform the same set of tests against the trunk version of misultin. Misultin (pronounced mee-sul-teen) is an Erlang library for building fast lightweight HTTP servers. Due to the fact that the same design criteria were used for misultin (e.g., embeddability and lean code), I presumed it would very closely match Yucan performance. However, please note that the code uses TCP backlog of 30 by default for some reason, which proved to be a bit less optimal in my Yucan tests (I did Yucan test with 64 backlog entries and it was a tiny bit worse than the one with 128 entries).

Anyway, here's the data (misultin-smp-30.csv):

Rate,Received reply rate,Normalized error rate (1/100%),"Generated request rate (also, expected reply rate)",Error rate,Attempt 1, Attempt 2, Attempt 3, Error 1, Error 2, Error 3
1000 rps,1000,0,1000,0,1000,1000,1000,0,0,0
2000 rps,1999,0,2000,0,2000,2000,1999,0,0,0
3000 rps,2999,0,3000,0,2999,2999,3000,0,0,0
4000 rps,3997,0,4000,0,3997,3997,3997,0,0,0
5000 rps,4998,0,5000,0,4998,4998,4998,0,0,0
6000 rps,5999,0,6000,0,5999,5999,5999,0,0,0
7000 rps,7001,0,7000,0,7001,7001,7001,0,0,0
8000 rps,5725,1600,8000,16,5985,4049,7141,29,9,10
9000 rps,7364,1700,9000,17,7187,7044,7862,21,13,17
10000 rps,6070,2733,10000,27,5759,6333,6119,29,27,26

Looking at these numbers, it is clear that misultin and Yucan are very similar in performance and load handling. Yucan starts to turn its nose at 9k RPS (5% errors), misultin is a bit earlier at 8k (16% errors). I can only applaud Roberto Ostinelli for developing this server, and recommend it to others, especially since it is incomparably more mature than my today's experiment with Yucan.

Warning: epoll/select: If you think you have discovered a potential problem with my test, and this problem is the lack of epoll use in Tornado, you are right. However, while using epoll (I will have to find Linux somewhere, which is not a trivial task due to relative scarcity of such systems) will almost certainly fix the Tornado crash problem, this is only part of the story. The other part is the baseline performance of Tornado as compared to other web servers, and here is where it gets interesting. My assessment is that enabling epoll will not help with its baseline performance. Why? Read my replies to several commenters below, wrt. the number of open sockets during the tests. If you want to repeat my test on a comprable Linux system, you are encouraged to do so, since webserver-benchmark.pl is available. I'll gladly publish the results here.

Translation: if you think epoll is better than select on a tiny number of hyper-active file descriptors, you are poised to do some reading. See


Nov. 3rd, 2009 09:48 pm (UTC)
Re: помедитировав на графики
2. не было тестов для торнадо. я их хотел сделать, но как только увидел, что tornado сдох, и увидел, почему, сразу перестал. Ибо труда много, а результаты всё равно интерпретировать сложно. Бэклог померяем, но select-ориентированность питона никуда не исчезнет же. Поэтому так.

2.1. Да вижу, что ты упомянул accept_filter. Не было accept_filter для юкана.

2.2. Теоретизировать интересно, но это всё мало смысла имеет, пока select in use.

У меня недавно опять линукс-ферма появилась сравнимой конфигурации, перетестирую на ней в какой-то момент.
Nov. 3rd, 2009 09:57 pm (UTC)
Re: помедитировав на графики
2. для оценки стоимости акцепта в питоне. ему может 512 надо.

2.1 а сложно accept_filter для юкана? что б понять как он влияет (сейчас) на нагрузку и как с баклогом связан. воде это всего пару строк добавить, опции выставить на сокете

2.2 это имеет много смысла даже с select, поскольку стеки у разных ос разные и их сравнение полезно. а в питоне совсем-совсем select? т.е. прием данных там тоже через select идет, без отдачи fd в тред/процесс/форк? если так, то учитывалось ли что тсандартный select расчитан на кажется 1K fd и на больше надо его особо вызывать и еще шаманить?
Nov. 3rd, 2009 10:09 pm (UTC)
Re: помедитировав на графики
2. всё равно сдохнет на селекте, или если не успеет сдохнуть — тест неэквивалентный, ибо селект. на линуксе протещу нормально.

2.1. да, можно. нужно? тогда придётся мерить не эрланг с питоном, а эрланг/freebsd/kqueue/accf vs. python/linux/qpoll, причём на разных, неодинаковых машинах. с точки зрения выбора системы — гуд, но получается слишком много движущихся критериев. думаешь, можно будет извлечь позитивный смысл из такого сравнения?

2.2. приём данных: похоже на то, тредов и процессов в торнадо нет. 1kfd: во freebsd размер пула select'а рулится в юзерленде, не в ядре. впрочем, на питон это не влияет скорее всего, потому что это вообще #define при компиляции сишника; того же python.exe, считай.
Nov. 3rd, 2009 10:26 pm (UTC)
Re: помедитировав на графики
2. для понимания и расстановки реперных точек все одно полезно
2.1. вопервых эрланг померяется с самим собой с и без accf. результат даст оценку эффективности accf. и почему на одну машину (из кластера) нельзя будет фрю поставить?

2.2 ну и как конкретный питон собирали? под пул селекта какого именно размера?
Nov. 3rd, 2009 10:31 pm (UTC)
Re: помедитировав на графики
2. ack
2.1. потому что это Amazon EC2, и на ней нет FreeBSD.

2.2. Дефолтная сборка. В отсутствие других данных, считаем, что 1k.
Nov. 3rd, 2009 10:51 pm (UTC)
Re: помедитировав на графики
2.1 хм. хм. возможно и это влияет? или в прошлые разы была другая платформа?
ну а волмарте взять писюк на неделю, потом вернуть?

2.2 тогда не удивительно, что оно так дохнет. и шо вы от него хотели, какую нагрузку держать? и количество ядер там не поможет. т.е. может оно и правда скалабельное и все такое, но питон-то тоже наверное нормально собрать нужно?

и http://people.freebsd.org/~dwhite/PyKQueue/ и http://code.google.com/p/python-kqueue/

и http://docs.python.org/library/select.html вроеде как есть стандартные обертки селекта в poll/kqueue?
Nov. 3rd, 2009 11:01 pm (UTC)
Re: помедитировав на графики
2.1. В прошлые разы был собственный Xeon в колокейшне. Выше написано.
Про волмарт: ...и самостоятельно настраивать его, да. Операционки менять. Нет уж, готовым воспользуюсь, какое есть: времени у меня не так много.

2.2. Я подозреваю, что если кто-то держит нагрузку в X единиц, и при этом у него в среднем полтора сокета занято (успевает быстро обслужить всё), то ему не важно, селект это, или kqueue/epoll. Влияние kqueue/epoll имеется тогда, в основном, когда нужно переключаться между многими сокетами, многие из которых idle. Выше в комментариях я неоднократно об этом говорил.

Про "и": я измерял offering, как он был предложен. Дали торнадо — измеряю торнадо. Если нужно торнадо оптимизировать перед использованием — так можно договориться и до того, чтобы всё на си переписать. Обе версии. И сравнить. Cмысла нет в этом. Так что то, что торнадо kqueue не использует, а только epoll — это к создателям торнадо.

Но, опять же, повторюсь, что на быстрых, коротких, никогда ничего не ждущих реквестах, которые создают нагрузку в единицы открытых дескрипторов в единицу времени, разницы между kqueue и select быть не должно быть.
Nov. 3rd, 2009 11:19 pm (UTC)
Re: помедитировав на графики
фря ставится за 10 минут, ога. настраивать -- да перебьется, пожалуй.

2.2. тогда не хватает еще одного измеренного параметра -- времени выполнения одного запроса.
Nov. 3rd, 2009 11:27 pm (UTC)
Re: помедитировав на графики
фря ставится за 10 минут, ога. настраивать -- да перебьется, пожалуй.

Точно так же можно взять этот скрипт тестирующий (он в посте в ссылках), и провести этот эксперимент самостоятельно. Тоже десять минут, типа.

Думаю, понятно, что тут не так всё десятиминутно.

2.2. да.
Nov. 3rd, 2009 11:30 pm (UTC)
Re: помедитировав на графики
у меня валмарта под боком нет, а два месяца назад мне твой пост не попался -- а у меня как раз были два свеженалитых ненагруженных одинаковых сервера