docs: add blog about benchmark test

pull/90/head
moonrailgun 2 years ago
parent 72aeccafac
commit 938a7acdd4

@ -0,0 +1,57 @@
---
title: The Tailchat benchmark report is freshly released, just takes 1.2 seconds to fully accept broadcast messages in 10k user
authors: moonrailgun
image: /img/logo.svg
slug: benchmark-report
keywords:
- tailchat
- benchmark
tags: [Report]
---
As an IM application, `Tailchat` naturally needs to be able to handle high concurrent multiplayer online capabilities.
In order to measure `Tailchat`'s ability to handle a large number of users and give our customers enough confidence, we decided to take the time to test the actual performance in the actual production environment.
Because in order to meet the scale requirements of users with different levels and needs, the underlying design of `Tailchat` is based on a distributed architecture, which means that we can carry business requirements of different scales through horizontal expansion.
However, the disadvantage of distributed is that more resources are spent on data communication and forwarding, and it is not as fast as the traditional centralized architecture in small-scale operations.
This seems to be a contradiction that can't have both fish and bear's paw, but in order to obtain the advantages of both, `Tailchat` has made some special optimizations on single-instance deployment, namely the shortest path principle. This means that if there is and only itself can consume in the mutual calls between microservices, the forwarding stage will be skipped and the request will be sent directly to itself.
This makes the capability of a single machine better than a cluster if the performance is sufficient. And when a single machine cannot support enough business needs, switch to cluster deployment and use multiple instances together to carry high-demand business scale.
![](/img/architecture/transport.excalidraw.png)
## Benchmark test method
In order to measure the performance of Tailchat in multiplayer online situations, we chose the ability to send and receive messages as the measurement standard.
That is: **When several people in a group are online at the same time, the time it takes for a complete link from the start of sending a message to the forwarding of all online users to receive the message**
> Because for the IM project, the traditional 90th/99th percentile data is meaningless, and the most basic ability performance is that all users can receive broadcast messages without loss. Therefore, the requirement for Tailchat is to only look at the 100th percentile data of the time-consuming message dissemination
At the same time, test the **growth performance** of **resident** cpu and memory when multiple people are online
> This test is supported by the cluster service provided by [sealos](https://sealos.io/). sealos is really convenient!
The pressure measurement method is mainly divided into three steps:
1. Register users in batches, and record the token (Token) returned by the system after registration to a local file
1. Load the token stored in the previous step, log in according to the token and establish a persistent connection. After all users have logged in, record the growth of resident resources.
1. After all users have logged in, select a user to connect and send a message to the designated channel of the designated group, and record the start time at the same time. When all connections have received this message, record the end time. At this time, record the end time and calculate the intermediate time consumption. This method counts several sets of data to eliminate errors.
## Benchmark test overview
This pressure test tested the performance of 100 users, 500 users, 1000 users, 2000 users, 5000 users, and 10000 users who are online at the same time and in the same group.
In order to squeeze the performance of `Tailchat` as much as possible, I chose to use the minimum configuration cap to test as many services as possible. In the case of 3 instances, the maximum support reached 800 users, and there was a problem. After expanding to 5 instances, it successfully supported 1,000 users, but when the number of simultaneous online users rose to about 1,300, a bottleneck appeared again. At this time, I guess that it may be caused by the ulimit that comes with the linux system. After all, no relevant directional optimization has been done before this. At this point, I think the cluster test may come to an end temporarily, and I transferred to the window platform.
Sure enough, it lived up to my expectations. There are no relevant restrictions on the windows platform, and the number of online users of `Tailchat` has successfully broken through the limit of 2k, 5k, or even 10k. At this point, I think our stress testing work has met our initial expectations. After all, the same industry has reached the upper limit of the same industry. Of course, I think its upper limit is far more than this, because there is still a lot of room for optimization.
In the highest 10,000-user use case, we tested the elapsed time from message sending to all users receiving the message 5 times. In the end we found that the answer given by `Tailchat` is 1.2 seconds, that is, a message will be sent to all users within 1.2 seconds. I think this data is quite ideal. After all, groups with 10,000 people online at the same time often have more than 100,000 people. Of course, in the follow-up, the situation of a large number of users will be further optimized, and this data will only be a starting point rather than an end point.
## Full Report
The specific benchmark test report can be found in: [Benchmark report](/docs/benchmark/report)

@ -0,0 +1,57 @@
---
title: Tailchat 压测报告新鲜出炉万人消息广播完全接受只需1.2秒
authors: moonrailgun
image: /img/logo.svg
slug: benchmark-report
keywords:
- tailchat
- benchmark
tags: [Report]
---
作为一个即时通讯应用Tailchat 天然就需要具备能够处理高并发多人在线能力的需求。
为了衡量`Tailchat`在处理大批量用户上的处理能力,给予我们的客户有足够的信心,我们决定花时间来测试在实际生产环境中实际表现。
因为为了面对来自不同程度与需求的用户的规模要求,`Tailchat` 的底层设计是基于分布式架构,这意味着我们可以通过水平扩容来承载不同的规模的业务需求。
但是分布式的缺点也在于花费了更多的资源在数据的通讯与转发上,在小规模的运行中比不上传统的集中式架构来的快。
这看上去是一个鱼和熊掌不可兼得的矛盾,但是为了同时获得两者的优势,`Tailchat` 在单实例部署上做了一些特殊的优化,即最短路径原则。这意味着在微服务之间相互调用中如果有且只有自身可以消费则跳过转发阶段直接把请求发送给自身。
这使得如果性能足够的情况下,单机的能力会优于集群。而当单机无法支撑足够的业务需求时,切换为集群部署用多个实例一起来承载高需求的业务规模。
![](/img/architecture/transport.excalidraw.png)
## 压测方式
为了衡量 Tailchat 在多人在线情况下表现,我们选择了消息发送与接收能力作为衡量标准。
即: **当一个群组中有若干人同时在线时,一条消息从发送开始到所有在线用户都接收到消息的转发这条完整链路所需要耗费的时间**
> 因为对于IM项目来说传统的90分位/99分位数据是无意义的只有所有用户都能接收到广播的消息、不丢失才是最基本的能力表现。因此对于Tailchat的要求也是只看消息传播耗时的100分位数据
同时,测试在多人在线时,**常驻**cpu与内存的**增长表现**
> 本次测试由 [sealos](https://sealos.io/) 提供集群服务支持。sealos 真的很方便!
压测方式主要分为三个步骤:
1. 批量注册用户,并把注册后系统返回的凭着(Token)记录到本地文件
1. 加载上一步存储的token并按照token进行登录并建立长连接。当所有用户登录完毕后记录常驻资源增长。
1. 当所有用户登录完毕后,选定一用户连接并发送消息到指定群组的指定频道,同时记录开始时间。当所有的连接都接收到这条消息后,记录结束时间。此时记录结束时间,计算中间耗时。该方法统计数次得到多组数据以消除误差。
## 压测概述
本次压测分别测试了 100用户、500用户、1000用户、2000用户、5000用户、10000用户同时在线且在同一个群组的性能表现。
为了尽可能压榨 `Tailchat` 的性能我把选择使用最低限度的配置上限来测试尽可能多的服务。在3实例的case中最高支撑到800人就出现问题了拓展到5实例后成功支撑起1000用户但是当用户同时在线人数上升到1300左右的时候又出现了瓶颈。此时我猜测可能是因为linux系统自带的ulimit导致的毕竟在此之前没有做好相关的定向优化。此时我觉得集群测试可能要暂时告一段落了我转移到window平台。
果然不负我的期待在windows平台没有相关的限制成功让 `Tailchat` 的在线人数突破到 2k、5k乃至10k的限制。此时我认为我们本次的压测工作已经符合一开始的期待了。毕竟万人同群已经达到的同等行业的上限。当然我认为其上限远不止于此因为还有很多的优化空间。
在最高的 10000 用户用例中我们测试了5次从消息发送到所有用户接受到消息的耗时。最终我们发现 `Tailchat` 给出的答卷是1.2秒, 即1.2秒内一条消息会发送到所有的用户。这个数据我认为还是比较理想的毕竟有1万人同时在线的群往往实际人数会达到10万人以上。当然在后续对大量用户的情况会进行进一步优化这个数据只会是一个起点而不是一个终点。
## 完整报告
具体的压测报告可见: [压测报告](/docs/benchmark/report)

Binary file not shown.

After

Width:  |  Height:  |  Size: 86 KiB

Loading…
Cancel
Save