通过TSC观察CPU

自Pentium开始x86 CPU均引入TSC了,可提供指令级执行时间度量的64位时间戳计数寄存器,随着CPU时钟自动增加。

CPU指令:

rdtsc: Real Time-Stamp Counter rdtscp: Real Time-Stamp Counter and Processor ID

调用:

Microsoft Visual C++:

unsigned __int64 __rdtsc();
unsigned __int64 __rdtscp( unsigned int * AUX );

Linux & gcc :

extern __inline unsigned long long
__attribute__((__gnu_inline__, __always_inline__, __artificial__))
__rdtsc (void) {
  return __builtin_ia32_rdtsc ();
}
extern __inline unsigned long long
__attribute__((__gnu_inline__, __always_inline__, __artificial__))
__rdtscp (unsigned int *__A)
{
  return __builtin_ia32_rdtscp (__A);
}

示例:

1: L1 cache及内存的延迟测量:

代码:

{
    ......
    /* flush cache line */
    _mm_clflush(&data[0]);

    /* measure cache miss latency */
    ts = rdtscp(&ui);
    m |= data[0];
    te = rdtscp(&ui);
    CALC_MIN(csci[0], ts, te);

    /* measure cache hit latency */
    ts = rdtscp(&ui);
    m &= data[0];
    te = rdtscp(&ui);
      /* flush cache line */
    _mm_clflush(&data[0]);

    /* measure cache miss latency */
    ts = rdtscp(&ui);
    m |= data[0];
    te = rdtscp(&ui);
    CALC_MIN(csci[0], ts, te);

    /* measure cache hit latency */
    ts = rdtscp(&ui);
    m &= data[0];
    te = rdtscp(&ui);
    CALC_MIN(csci[1], ts, te);
    CALC_MIN(csci[1], ts, te);
}

结果:

rdtscp指令自身耗时:X86: 31 X64: 33

架构 X86 X64
长度 BYTE WORD DWORD QWORD BYTE WORD DWORD QWORD
冷:内存 244 241 246 250 254 254 260 261
热:L1 31 31 31 31 35 35 35 35

问题:读取1个字节与8个字节的所用的时间是一样的,为什么?

2: 常见整型运算及多条指令执行周期:

代码:

{
    ......
    /* measure mul latency */
    ts = rdtscp(&ui);
    m *= *((U32 *)&data[0]);
    te = rdtscp(&ui);
    CALC_MIN(csci[2], ts, te);

    /* measure div latnecy */
    ts = rdtscp(&ui);
    m /= *((U32 *)&data[0]);
    te = rdtscp(&ui);
    CALC_MIN(csci[3], ts, te);

    /* measure 2*mul latnecy */
    ts = rdtscp(&ui);
    m *= *((U32 *)&data[0]);
    m *= *((U32 *)&data[0]);
    te = rdtscp(&ui);
    CALC_MIN(csci2[0], ts, te);

    /* double div */
    ts = rdtscp(&ui);
    m /= *((U32 *)&data[0]);
    m /= *((U32 *)&data[0]);
    te = rdtscp(&ui);
    CALC_MIN(csci2[1], ts, te);

    /* mul + div */
    ts = rdtscp(&ui);
    m *= *((U32 *)&data[0]);
    m /= *((U32 *)&data[0]);
    te = rdtscp(&ui);
    CALC_MIN(csci2[2], ts, te);

    /* measure float mul latency */
    ts = rdtscp(&ui);
    f = f * m;
    te = rdtscp(&ui);
    CALC_MIN(csci[4], ts, te);

    /* measure float div latency */
    while (!m)
        m = rand();
    ts = rdtscp(&ui);
    f = f / m;
    te = rdtscp(&ui);
    CALC_MIN(csci[5], ts, te);
}

结果:

指令周期
数据类型 整型 浮点数
指令组合 m* m/ m, m m, n m/, m/ m/, n/ m*, m/ f* f/
指行时间 2 20 4 4 48 26 24 17 26

问题:m及n的除法运算的耗时只比m的除法多了一点,但却明显少于m的两次除法,为什么?

注意事项:

  1. 考虑到CPU乱序执行的问题,rdtsc需要配合cpuid或lfence指令,以保证计这一刻流水线已排空,即rdtsc要测量的指令已执行完。后来的CPU提供了rdtscp指令,相当于cpuid + rdtsc,但cpuid指令本身的执行周期有波动,而rdtscp指令的执行更稳定。不过rdtscp不是所有的CPU都支持,使用前要通过cpuid指令查询是不是支持: 即CPUID.80000001H:EDX.RDTSCP[bit 27]是不是为1
  2. 多核系统:新的CPU支持了Invariant TSC特性,可以保证在默认情况下各核心看到的TSC是一致的,否则测量代码执行时不能调度至其它核心上。另外TSC是可以通过MSR来修改的,这种情况下也要注意: Invariant TSC: Software can modify the value of the time-stamp counter (TSC) of a logical processor by using the WRMSR instruction to write to the IA32_TIME_STAMP_COUNTER MSR
  3. CPU降频问题:第一代TSC的实现是Varient TSC,没有考虑到降频的问题,故在低功耗TSC计数会变慢,甚至停止;后来又有了Constant TSC,解决了降频的问题,但在DEEP-C状态下依然会发生停止计数的情况,所以又有了最新的Invariant TSC的特性: The time stamp counter in newer processors may support an enhancement, referred to as invariant TSC. Processor’s support for invariant TSC is indicated by CPUID.80000007H:EDX[8]. The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. This is the architectural behavior moving forward. On processors with invariant TSC support, the OS may use the TSC for wall clock timer services (instead of ACPI or HPET timers). TSC reads are much more efficient and do not incur the overhead associated with a ring transition or access to a platform resource.
  4. 指令本身的时间开销 Pentinum Gold G5500T: 31 cycles Core i7-7820HQ: 25 cycles
  5. 权限问题(此指令可用于时序攻击,如Meltdown及Spectre): CR4.TSD: Time Stamp Disable (bit 2 of CR4) — Restricts the execution of the RDTSC instruction to procedures running at privilege level 0 when set; allows RDTSC instruction to be executed at any privilege level when clear. This bit also applies to the RDTSCP instruction if supported (if CPUID.80000001H:EDX[27] = 1).
  6. 计数器溢出可能:计算器本身是64位的,即使是主频4G的CPU,也要100多年才会溢出,对于我们的测量来说可以不用考虑
  7. 时序测量容易被干扰(线程调度、抢占、系统中断、虚拟化等),要求测量的指令序列尽量短,并且需要进行多次测量