技术分享
When a Microsecond Is an Eternity: High Performance Trading Systems in C++
00 min
Oct 9, 2023
Sep 30, 2024
type
status
date
slug
summary
tags
category
icon
password
💡
low latency in autotrading system
measurement of performance

Some tips for electronic market making

  1. cutting losses
  1. cutting losses
  1. cutting losses
    1. then you may have chance
Success means being faster than the competition
notion image
A very good minimum time (wire to wire) for a software-based trading system is around 2.5us

Safety

But Satefy First!!!
A lot could happen in a few seconds in an uncontrolled system
The best approach is to automate detection of failures
Jitter is unacceptable - it means bad trade may happen.

Hotpath

The “hotpath” doesn’t be excuted frequently, it is only exercised 0.01% of the time - the rest of the time, the system is idle or doing administrative work
But when it works, we hope it be as fast as it can which is at odds with Operating System, networks and hardware that are focused on throughoutput and fairness. e.g, network want to catch up bytes until there’s enough for the packet to be fully sent, and make sure that no one sender is saturating the network. Same for OS, multitasking.

Some things about C++ code

there are lots of factors impact performance of C++:
  • Compiler (and version)
  • Machine architecture
  • 3rd libraries
  • Build and link flags
So, we need to check what C++ is doing in terms of machine instructions
  1. less machine code may be faster
  1. some instructions like call, jump (branch) are expensive

Low latency programming techniqures

Cache

hyper threding → e.g two threads on a core, cache goes down to 50% even zero, so be slower.
notion image

Branch

Compiler will optimize “hotpath”:
  • less pressure on cache (both instruction cache and data cache)
  • fewer branches for the hardware branch predictor to deal with
Error handling functions on left maybe they’re trampling your caches!
just test an interger is fast extremely
notion image

Use Template instead of virtual function

If you don’t know what class is going to instantiate in compiler time, virtual function is fine
notion image
notion image
When we call MainLoop on an instance of OrderManager, it will use the SendOrder method from the specified implementation (OrderSenderA or OrderSenderB) to send orders.
 
If you know the complete set of things that may be instantiated.
那么你不需要用virtual函数,只需要用template特化
notion image

Lambda

It just populates message, doesn’t allocate new memory
notion image
赋值是直接写入的,不读取
notion image
notion image
Memory allocation is expensive inludes destructing it

减少分支,多用模板

notion image
这个分支判断出现在每一个函数,需要输入 side参数,比如CalcPrice需要知道side是buy还是sell然后决定执行价格的计算
notion image
在compile time我们已经知道动作是buy还是sell,在hotpath里面尽可能把分支结构去除
notion image
notion image
尽可能避免多线程
notion image
notion image
应该是减少多次调用(要节约速度不要节约内存) 一次性就全部读掉

索引问题

notion image
use CSDL?
你绝对不会想要在链表里不断jump,去读取数据
notion image
collision不可避免,但是总体还是好的
Google benchmark
notion image
notion image
notion image
Optiver 内部自己也实现了很多hash tables
notion image
索引可以直接都放入cache(一组8+8)
notion image
数据集大小影响不大,并且指令处理更快乐,cache miss也更少了

Inline

notion image
keyword “inline” does not means “please inline”, it means “there may be multiple definitions, that’s okay, link it please, overlook it”
If you want to inline, you should use compiler attribute to force inline or not inline code
inline可能更快,也可能更慢,you need measure it, loot at the disassembly
这里因为logging不需要太多的arguments,所以inline也可以,但是这会污染hotpath的instruction cache

Most Important

因为hotpath很少执行
notion image
keep cache hot → 假设我们不论做什么都会触发hotpath
 
notion image
不一定真发到交易所,可以在后面截止掉
push data到DPU,但不send,it's warming the card,DPU也会支持这一点
不仅能够保持instructino和data的cache hot,通过重复10000次(假设),这也会训练分支预测器更快选择正确的分支
 
profile-guided optimization?(我不太懂这个怎么用)和这的作用有点类似,但是it do at compiler time, it’s not gonna help with the hardware runtime
It run a simulation even run production
文件里写的information,the compiler can use that for branch prediction hints and reordering
但要确保文件里写的东西和系统要做的事一致,不然大大偏离了想要的效果
notion image
notion image
有8 cores share中间50MB的L3 Cache, 这里我们关掉其他7 cores,只让一个core独享
notion image
notion image
编译器只是按照规范实现,做了额外的检查,这背离了C++哲学,我们不需要为自己不用的东西买单
如果你要传一个指针,你要自己确保指针不为空
现在新的编译器已经移除了这个空指针检查。传nullptr进来是UB
一个小小的检查(更多的指令),可能就位于inline和不inline的边界,这可能导致因此放弃了一堆优化(GCC对此特别敏感)
notion image
老版本,没有COW,也没有对null、small string的优化
notion image
 
多线程环境下 static var也只初始化一次
如果发生exception,破坏了线程,导致没初始化,也是没问题的(?)
但是对它的使用,需要检查guard var,具有额外cost
notion image
也可能不alllocate,取决于实现,这里分配了
捕获了24字节,但什么都不做,然后释放掉(PPT上没放)
Clang会优化这个,GCC不会
 
notion image
non-allocate version
在使用时才分配,在stack上分配,with compile time static asserts? It also runtime checks if need be as well.
notion image
glibc be evil if you want low latency
try to avoid syscall, do not be into the kernel, just let your C++ code to run
甚至不希望有中断
hotpath就一个人安静地执行就最好

Measurement

"Bottlenecks occur in surprising places, so don't try to second guess and put in a speed hack until you've proven that's where the bottleneck is.”
notion image
 
  • Profile:告诉你大部分时间花在了malloc上
  • Benchmark:just start, and stop
代码改完后还要测,不停地测,不要猜!!!
 
notion image
什么都不做,300ns-400ns突然做了一件事,然后又保持不错,可能会被粗过
运行的是VM,模拟CPU,甚至不考虑IO characteristics,it’s too intrusive,answet not accurate
benchmark的东西与production的情况不太一样,比较适合测Vector这样的数据结构
 
模拟生产环境
在交换机测试,把时间戳写入Packet header
精度在5ns
模拟一天后,然后去分析数据包
然而环境很难设置
  • turn off particular interwraps
  • set CPU affinity

 
notion image
注意cache 和其他process的影响
使用 static assert, const expression, template
简单代码运行更快
有时候不需要绝对的精度,可以运算更快,比如price只用两位小数点,而不需要十几位小数点

Q&A

turbo boost 使用?来锁Cache
 
验证一个hash值,caceh里这一个hash值和对应的数组,就不用指针了
hash的设计没有绝对对错,只有最适合的
 
不要干扰一个core上的一个fast process,处理数据:
取决于数据量大小
  • 一个组件过滤数据后再交给hotpath,避免占用太多,同时能够快速响应
  • 或者单独开一个线程处理大量数据,不要影响fast process,fast process定时来读取
 
58mins不理解,tcmalloc里耗尽某一个size的class?deallocate导致global mutex锁掉所有thread?
可以在单独的线程上创建,然后传回指针
 
用expect标记分支,比如在一个比较糟糕、随机的训练环境,可以使用这个来引导分支预测器

后记

much optimization could be done in theory, but it doesn’t in practice. So you should guarantee what compiler do for you.
 

Comments