SKL_EVENTS(3CPC) CPU Performance Counters Library Functions SKL_EVENTS(3CPC)

NAME


skl_events - processor model specific performance counter events

DESCRIPTION


This manual page describes events specific to the following Intel CPU
models and is derived from Intel's perfmon data. For more information,
please consult the Intel Software Developer's Manual or Intel's perfmon
website.

CPU models described by this document:

+o Family 0x6, Model 0xa6

+o Family 0x6, Model 0xa5

+o Family 0x6, Model 0x9e

+o Family 0x6, Model 0x8e

+o Family 0x6, Model 0x5e

+o Family 0x6, Model 0x4e

The following events are supported:

ld_blocks.store_forward
Counts the number of times where store forwarding was prevented for
a load operation. The most common case is a load blocked due to the
address of memory access (partially) overlapping with a preceding
uncompleted store. Note: See the table of not supported store
forwards in the Optimization Guide.

ld_blocks.no_sr
The number of times that split load operations are temporarily
blocked because all resources for handling the split accesses are
in use.

ld_blocks_partial.address_alias
Counts false dependencies in MOB when the partial comparison upon
loose net check and dependency was resolved by the Enhanced Loose
net mechanism. This may not result in high performance penalties.
Loose net checks can fail when loads and stores are 4k aliased.

dtlb_load_misses.miss_causes_a_walk
Counts demand data loads that caused a page walk of any page size
(4K/2M/4M/1G). This implies it missed in all TLB levels, but the
walk need not have completed.

dtlb_load_misses.walk_completed_4k
Counts completed page walks (4K sizes) caused by demand data
loads. This implies address translations missed in the DTLB and
further levels of TLB. The page walk can end with or without a
fault.

dtlb_load_misses.walk_completed_2m_4m
Counts completed page walks (2M/4M sizes) caused by demand data
loads. This implies address translations missed in the DTLB and
further levels of TLB. The page walk can end with or without a
fault.

dtlb_load_misses.walk_completed_1g
Counts completed page walks (1G sizes) caused by demand data
loads. This implies address translations missed in the DTLB and
further levels of TLB. The page walk can end with or without a
fault.

dtlb_load_misses.walk_completed
Counts completed page walks (all page sizes) caused by demand data
loads. This implies it missed in the DTLB and further levels of
TLB. The page walk can end with or without a fault.

dtlb_load_misses.walk_pending
Counts 1 per cycle for each PMH that is busy with a page walk for a
load. EPT page walk duration are excluded in Skylake
microarchitecture.

dtlb_load_misses.walk_active
Counts cycles when at least one PMH (Page Miss Handler) is busy
with a page walk for a load.

dtlb_load_misses.stlb_hit
Counts loads that miss the DTLB (Data TLB) and hit the STLB (Second
level TLB).

memory_disambiguation.history_reset
tbd

int_misc.recovery_cycles
Core cycles the Resource allocator was stalled due to recovery from
an earlier branch misprediction or machine clear event.

int_misc.recovery_cycles_any
Core cycles the allocator was stalled due to recovery from earlier
clear event for any thread running on the physical core (e.g.
misprediction or memory nuke).

int_misc.clear_resteer_cycles
Cycles the issue-stage is waiting for front-end to fetch from
resteered path following branch misprediction or machine clear
events.

uops_issued.any
Counts the number of uops that the Resource Allocation Table (RAT)
issues to the Reservation Station (RS).

uops_issued.stall_cycles
Counts cycles during which the Resource Allocation Table (RAT) does
not issue any Uops to the reservation station (RS) for the current
thread.

uops_issued.vector_width_mismatch
Counts the number of Blend Uops issued by the Resource Allocation
Table (RAT) to the reservation station (RS) in order to preserve
upper bits of vector registers. Starting with the Skylake
microarchitecture, these Blend uops are needed since every Intel
SSE instruction executed in Dirty Upper State needs to preserve
bits 128-255 of the destination register. For more information,
refer to Mixing Intel AVX and Intel SSE Code section of the
Optimization Guide.

uops_issued.slow_lea
Number of slow LEA uops being allocated. A uop is generally
considered SlowLea if it has 3 sources (e.g. 2 sources + immediate)
regardless if as a result of LEA instruction or not.

arith.divider_active
Cycles when divide unit is busy executing divide or square root
operations. Accounts for integer and floating-point operations.

l2_rqsts.demand_data_rd_miss
Counts the number of demand Data Read requests that miss L2 cache.
Only not rejected loads are counted.

l2_rqsts.rfo_miss
Counts the RFO (Read-for-Ownership) requests that miss L2 cache.

l2_rqsts.code_rd_miss
Counts L2 cache misses when fetching instructions.

l2_rqsts.all_demand_miss
Demand requests that miss L2 cache.

l2_rqsts.pf_miss
Counts requests from the L1/L2/L3 hardware prefetchers or Load
software prefetches that miss L2 cache.

l2_rqsts.miss
All requests that miss L2 cache.

l2_rqsts.demand_data_rd_hit
Counts the number of demand Data Read requests, initiated by load
instructions, that hit L2 cache

l2_rqsts.rfo_hit
Counts the RFO (Read-for-Ownership) requests that hit L2 cache.

l2_rqsts.code_rd_hit
Counts L2 cache hits when fetching instructions, code reads.

l2_rqsts.pf_hit
Counts requests from the L1/L2/L3 hardware prefetchers or Load
software prefetches that hit L2 cache.

l2_rqsts.all_demand_data_rd
Counts the number of demand Data Read requests (including requests
from L1D hardware prefetchers). These loads may hit or miss L2
cache. Only non rejected loads are counted.

l2_rqsts.all_rfo
Counts the total number of RFO (read for ownership) requests to L2
cache. L2 RFO requests include both L1D demand RFO misses as well
as L1D RFO prefetches.

l2_rqsts.all_code_rd
Counts the total number of L2 code requests.

l2_rqsts.all_demand_references
Demand requests to L2 cache.

l2_rqsts.all_pf
Counts the total number of requests from the L2 hardware
prefetchers.

l2_rqsts.references
All L2 requests.

longest_lat_cache.miss
Counts core-originated cacheable requests that miss the L3 cache
(Longest Latency cache). Requests include data and code reads,
Reads-for-Ownership (RFOs), speculative accesses and hardware
prefetches from L1 and L2. It does not include all misses to the
L3.

The following errata may apply to this: SKL057

longest_lat_cache.reference
Counts core-originated cacheable requests to the L3 cache (Longest
Latency cache). Requests include data and code reads, Reads-for-
Ownership (RFOs), speculative accesses and hardware prefetches from
L1 and L2. It does not include all accesses to the L3.

The following errata may apply to this: SKL057

sw_prefetch_access.nta
Number of PREFETCHNTA instructions executed.

sw_prefetch_access.t0
Number of PREFETCHT0 instructions executed.

sw_prefetch_access.t1_t2
Number of PREFETCHT1 or PREFETCHT2 instructions executed.

sw_prefetch_access.prefetchw
Number of PREFETCHW instructions executed.

cpu_clk_unhalted.thread_p
This is an architectural event that counts the number of thread
cycles while the thread is not in a halt state. The thread enters
the halt state when it is running the HLT instruction. The core
frequency may change from time to time due to power or thermal
throttling. For this reason, this event may have a changing ratio
with regards to wall clock time.

cpu_clk_unhalted.thread_p_any
Core cycles when at least one thread on the physical core is not in
halt state.

cpu_clk_unhalted.ring0_trans
Counts when the Current Privilege Level (CPL) transitions from ring
1, 2 or 3 to ring 0 (Kernel).

cpu_clk_thread_unhalted.ref_xclk
Core crystal clock cycles when the thread is unhalted.

cpu_clk_thread_unhalted.ref_xclk_any
Core crystal clock cycles when at least one thread on the physical
core is unhalted.

cpu_clk_unhalted.ref_xclk
Core crystal clock cycles when the thread is unhalted.

cpu_clk_unhalted.ref_xclk_any
Core crystal clock cycles when at least one thread on the physical
core is unhalted.

cpu_clk_thread_unhalted.one_thread_active
Core crystal clock cycles when this thread is unhalted and the
other thread is halted.

cpu_clk_unhalted.one_thread_active
Core crystal clock cycles when this thread is unhalted and the
other thread is halted.

l1d_pend_miss.pending
Counts duration of L1D miss outstanding, that is each cycle number
of Fill Buffers (FB) outstanding required by Demand Reads. FB
either is held by demand loads, or it is held by non-demand loads
and gets hit at least once by demand. The valid outstanding
interval is defined until the FB deallocation by one of the
following ways: from FB allocation, if FB is allocated by demand
from the demand Hit FB, if it is allocated by hardware or software
prefetch.Note: In the L1D, a Demand Read contains cacheable or
noncacheable demand loads, including ones causing cache-line splits
and reads due to page walks resulted from any request type.

l1d_pend_miss.pending_cycles
Counts duration of L1D miss outstanding in cycles.

l1d_pend_miss.pending_cycles_any
Cycles with L1D load Misses outstanding from any thread on physical
core.

l1d_pend_miss.fb_full
Number of times a request needed a FB (Fill Buffer) entry but there
was no entry available for it. A request includes
cacheable/uncacheable demands that are load, store or SW prefetch
instructions.

dtlb_store_misses.miss_causes_a_walk
Counts demand data stores that caused a page walk of any page size
(4K/2M/4M/1G). This implies it missed in all TLB levels, but the
walk need not have completed.

dtlb_store_misses.walk_completed_4k
Counts completed page walks (4K sizes) caused by demand data
stores. This implies address translations missed in the DTLB and
further levels of TLB. The page walk can end with or without a
fault.

dtlb_store_misses.walk_completed_2m_4m
Counts completed page walks (2M/4M sizes) caused by demand data
stores. This implies address translations missed in the DTLB and
further levels of TLB. The page walk can end with or without a
fault.

dtlb_store_misses.walk_completed_1g
Counts completed page walks (1G sizes) caused by demand data
stores. This implies address translations missed in the DTLB and
further levels of TLB. The page walk can end with or without a
fault.

dtlb_store_misses.walk_completed
Counts completed page walks (all page sizes) caused by demand data
stores. This implies it missed in the DTLB and further levels of
TLB. The page walk can end with or without a fault.

dtlb_store_misses.walk_pending
Counts 1 per cycle for each PMH that is busy with a page walk for a
store. EPT page walk duration are excluded in Skylake
microarchitecture.

dtlb_store_misses.walk_active
Counts cycles when at least one PMH (Page Miss Handler) is busy
with a page walk for a store.

dtlb_store_misses.stlb_hit
Stores that miss the DTLB (Data TLB) and hit the STLB (2nd Level
TLB).

load_hit_pre.sw_pf
Counts all not software-prefetch load dispatches that hit the fill
buffer (FB) allocated for the software prefetch. It can also be
incremented by some lock instructions. So it should only be used
with profiling so that the locks can be excluded by ASM (Assembly
File) inspection of the nearby instructions.

ept.walk_pending
Counts cycles for each PMH (Page Miss Handler) that is busy with an
EPT (Extended Page Table) walk for any request type.

l1d.replacement
Counts L1D data line replacements including opportunistic
replacements, and replacements that require stall-for-replace or
block-for-replace.

tx_mem.abort_conflict
Number of times a TSX line had a cache conflict.

tx_mem.abort_capacity
Number of times a transactional abort was signaled due to a data
capacity limitation for transactional reads or writes.

tx_mem.abort_hle_store_to_elided_lock
Number of times a TSX Abort was triggered due to a non-
release/commit store to lock.

tx_mem.abort_hle_elision_buffer_not_empty
Number of times a TSX Abort was triggered due to commit but Lock
Buffer not empty.

tx_mem.abort_hle_elision_buffer_mismatch
Number of times a TSX Abort was triggered due to release/commit but
data and address mismatch.

tx_mem.abort_hle_elision_buffer_unsupported_alignment
Number of times a TSX Abort was triggered due to attempting an
unsupported alignment from Lock Buffer.

tx_mem.hle_elision_buffer_full
Number of times we could not allocate Lock Buffer.

partial_rat_stalls.scoreboard
This event counts cycles during which the microcode scoreboard
stalls happen.

tx_exec.misc1
Counts the number of times a class of instructions that may cause a
transactional abort was executed. Since this is the count of
execution, it may not always cause a transactional abort.

tx_exec.misc2
Unfriendly TSX abort triggered by a vzeroupper instruction.

tx_exec.misc3
Unfriendly TSX abort triggered by a nest count that is too deep.

tx_exec.misc4
RTM region detected inside HLE.

tx_exec.misc5
Counts the number of times an HLE XACQUIRE instruction was executed
inside an RTM transactional region.

rs_events.empty_cycles
Counts cycles during which the reservation station (RS) is empty
for the thread.; Note: In ST-mode, not active thread should drive
0. This is usually caused by severely costly branch mispredictions,
or allocator/FE issues.

rs_events.empty_end
Counts end of periods where the Reservation Station (RS) was empty.
Could be useful to precisely locate front-end Latency Bound issues.

offcore_requests_outstanding.demand_data_rd
Counts the number of offcore outstanding Demand Data Read
transactions in the super queue (SQ) every cycle. A transaction is
considered to be in the Offcore outstanding state between L2 miss
and transaction completion sent to requestor. See the corresponding
Umask under OFFCORE_REQUESTS.Note: A prefetch promoted to Demand is
counted from the promotion point.

offcore_requests_outstanding.cycles_with_demand_data_rd
Counts cycles when offcore outstanding Demand Data Read
transactions are present in the super queue (SQ). A transaction is
considered to be in the Offcore outstanding state between L2 miss
and transaction completion sent to requestor (SQ de-allocation).

offcore_requests_outstanding.demand_data_rd_ge_6
Cycles with at least 6 offcore outstanding Demand Data Read
transactions in uncore queue.

offcore_requests_outstanding.demand_code_rd
Counts the number of offcore outstanding Code Reads transactions in
the super queue every cycle. The 'Offcore outstanding' state of the
transaction lasts from the L2 miss until the sending transaction
completion to requestor (SQ deallocation). See the corresponding
Umask under OFFCORE_REQUESTS.

offcore_requests_outstanding.cycles_with_demand_code_rd
Counts the number of offcore outstanding Code Reads transactions in
the super queue every cycle. The 'Offcore outstanding' state of the
transaction lasts from the L2 miss until the sending transaction
completion to requestor (SQ deallocation). See the corresponding
Umask under OFFCORE_REQUESTS.

offcore_requests_outstanding.demand_rfo
Counts the number of offcore outstanding RFO (store) transactions
in the super queue (SQ) every cycle. A transaction is considered to
be in the Offcore outstanding state between L2 miss and transaction
completion sent to requestor (SQ de-allocation). See corresponding
Umask under OFFCORE_REQUESTS.

offcore_requests_outstanding.cycles_with_demand_rfo
Counts the number of offcore outstanding demand rfo Reads
transactions in the super queue every cycle. The 'Offcore
outstanding' state of the transaction lasts from the L2 miss until
the sending transaction completion to requestor (SQ deallocation).
See the corresponding Umask under OFFCORE_REQUESTS.

offcore_requests_outstanding.all_data_rd
Counts the number of offcore outstanding cacheable Core Data Read
transactions in the super queue every cycle. A transaction is
considered to be in the Offcore outstanding state between L2 miss
and transaction completion sent to requestor (SQ de-allocation).
See corresponding Umask under OFFCORE_REQUESTS.

offcore_requests_outstanding.cycles_with_data_rd
Counts cycles when offcore outstanding cacheable Core Data Read
transactions are present in the super queue. A transaction is
considered to be in the Offcore outstanding state between L2 miss
and transaction completion sent to requestor (SQ de-allocation).
See corresponding Umask under OFFCORE_REQUESTS.

offcore_requests_outstanding.l3_miss_demand_data_rd
Counts number of Offcore outstanding Demand Data Read requests that
miss L3 cache in the superQ every cycle.

offcore_requests_outstanding.cycles_with_l3_miss_demand_data_rd
Cycles with at least 1 Demand Data Read requests who miss L3 cache
in the superQ.

offcore_requests_outstanding.l3_miss_demand_data_rd_ge_6
Cycles with at least 6 Demand Data Read requests that miss L3 cache
in the superQ.

idq.mite_uops
Counts the number of uops delivered to Instruction Decode Queue
(IDQ) from the MITE path. Counting includes uops that may 'bypass'
the IDQ. This also means that uops are not being delivered from the
Decode Stream Buffer (DSB).

idq.mite_cycles
Counts cycles during which uops are being delivered to Instruction
Decode Queue (IDQ) from the MITE path. Counting includes uops that
may 'bypass' the IDQ.

idq.dsb_uops
Counts the number of uops delivered to Instruction Decode Queue
(IDQ) from the Decode Stream Buffer (DSB) path. Counting includes
uops that may 'bypass' the IDQ.

idq.dsb_cycles
Counts cycles during which uops are being delivered to Instruction
Decode Queue (IDQ) from the Decode Stream Buffer (DSB) path.
Counting includes uops that may 'bypass' the IDQ.

idq.ms_dsb_cycles
Counts cycles during which uops initiated by Decode Stream Buffer
(DSB) are being delivered to Instruction Decode Queue (IDQ) while
the Microcode Sequencer (MS) is busy. Counting includes uops that
may 'bypass' the IDQ.

idq.all_dsb_cycles_4_uops
Counts the number of cycles 4 uops were delivered to Instruction
Decode Queue (IDQ) from the Decode Stream Buffer (DSB) path. Count
includes uops that may 'bypass' the IDQ.

idq.all_dsb_cycles_any_uops
Counts the number of cycles uops were delivered to Instruction
Decode Queue (IDQ) from the Decode Stream Buffer (DSB) path. Count
includes uops that may 'bypass' the IDQ.

idq.ms_mite_uops
Counts the number of uops initiated by MITE and delivered to
Instruction Decode Queue (IDQ) while the Microcode Sequencer (MS)
is busy. Counting includes uops that may 'bypass' the IDQ.

idq.all_mite_cycles_4_uops
Counts the number of cycles 4 uops were delivered to the
Instruction Decode Queue (IDQ) from the MITE (legacy decode
pipeline) path. Counting includes uops that may 'bypass' the IDQ.
During these cycles uops are not being delivered from the Decode
Stream Buffer (DSB).

idq.all_mite_cycles_any_uops
Counts the number of cycles uops were delivered to the Instruction
Decode Queue (IDQ) from the MITE (legacy decode pipeline) path.
Counting includes uops that may 'bypass' the IDQ. During these
cycles uops are not being delivered from the Decode Stream Buffer
(DSB).

idq.ms_cycles
Counts cycles during which uops are being delivered to Instruction
Decode Queue (IDQ) while the Microcode Sequencer (MS) is busy.
Counting includes uops that may 'bypass' the IDQ. Uops maybe
initiated by Decode Stream Buffer (DSB) or MITE.

idq.ms_switches
Number of switches from DSB (Decode Stream Buffer) or MITE (legacy
decode pipeline) to the Microcode Sequencer.

idq.ms_uops
Counts the total number of uops delivered by the Microcode
Sequencer (MS). Any instruction over 4 uops will be delivered by
the MS. Some instructions such as transcendentals may additionally
generate uops from the MS.

icache_16b.ifdata_stall
Cycles where a code line fetch is stalled due to an L1 instruction
cache miss. The legacy decode pipeline works at a 16 Byte
granularity.

icache_64b.iftag_hit
Instruction fetch tag lookups that hit in the instruction cache
(L1I). Counts at 64-byte cache-line granularity.

icache_64b.iftag_miss
Instruction fetch tag lookups that miss in the instruction cache
(L1I). Counts at 64-byte cache-line granularity.

icache_64b.iftag_stall
Cycles where a code fetch is stalled due to L1 instruction cache
tag miss.

itlb_misses.miss_causes_a_walk
Counts page walks of any page size (4K/2M/4M/1G) caused by a code
fetch. This implies it missed in the ITLB and further levels of
TLB, but the walk need not have completed.

itlb_misses.walk_completed_4k
Counts completed page walks (4K page sizes) caused by a code fetch.
This implies it missed in the ITLB (Instruction TLB) and further
levels of TLB. The page walk can end with or without a fault.

itlb_misses.walk_completed_2m_4m
Counts completed page walks (2M/4M page sizes) caused by a code
fetch. This implies it missed in the ITLB (Instruction TLB) and
further levels of TLB. The page walk can end with or without a
fault.

itlb_misses.walk_completed_1g
Counts completed page walks (1G page sizes) caused by a code fetch.
This implies it missed in the ITLB (Instruction TLB) and further
levels of TLB. The page walk can end with or without a fault.

itlb_misses.walk_completed
Counts completed page walks (all page sizes) caused by a code
fetch. This implies it missed in the ITLB (Instruction TLB) and
further levels of TLB. The page walk can end with or without a
fault.

itlb_misses.walk_pending
Counts 1 per cycle for each PMH (Page Miss Handler) that is busy
with a page walk for an instruction fetch request. EPT page walk
duration are excluded in Skylake michroarchitecture.

itlb_misses.stlb_hit
Instruction fetch requests that miss the ITLB and hit the STLB.

ild_stall.lcp
Counts cycles that the Instruction Length decoder (ILD) stalls
occurred due to dynamically changing prefix length of the decoded
instruction (by operand size prefix instruction 0x66, address size
prefix instruction 0x67 or REX.W for Intel64). Count is
proportional to the number of prefixes in a 16B-line. This may
result in a three-cycle penalty for each LCP (Length changing
prefix) in a 16-byte chunk.

idq_uops_not_delivered.core
Counts the number of uops not delivered to Resource Allocation
Table (RAT) per thread adding 4 x when Resource Allocation Table
(RAT) is not stalled and Instruction Decode Queue (IDQ) delivers x
uops to Resource Allocation Table (RAT) (where x belongs to
{0,1,2,3}). Counting does not cover cases when: a. IDQ-Resource
Allocation Table (RAT) pipe serves the other thread. b. Resource
Allocation Table (RAT) is stalled for the thread (including uop
drops and clear BE conditions). c. Instruction Decode Queue (IDQ)
delivers four uops.

idq_uops_not_delivered.cycles_0_uops_deliv.core
Counts, on the per-thread basis, cycles when no uops are delivered
to Resource Allocation Table (RAT). IDQ_Uops_Not_Delivered.core =4.

idq_uops_not_delivered.cycles_le_1_uop_deliv.core
Counts, on the per-thread basis, cycles when less than 1 uop is
delivered to Resource Allocation Table (RAT).
IDQ_Uops_Not_Delivered.core >= 3.

idq_uops_not_delivered.cycles_le_2_uop_deliv.core
Cycles with less than 2 uops delivered by the front-end.

idq_uops_not_delivered.cycles_le_3_uop_deliv.core
Cycles with less than 3 uops delivered by the front-end.

idq_uops_not_delivered.cycles_fe_was_ok
Counts cycles FE delivered 4 uops or Resource Allocation Table
(RAT) was stalling FE.

uops_dispatched_port.port_0
Counts, on the per-thread basis, cycles during which at least one
uop is dispatched from the Reservation Station (RS) to port 0.

uops_dispatched_port.port_1
Counts, on the per-thread basis, cycles during which at least one
uop is dispatched from the Reservation Station (RS) to port 1.

uops_dispatched_port.port_2
Counts, on the per-thread basis, cycles during which at least one
uop is dispatched from the Reservation Station (RS) to port 2.

uops_dispatched_port.port_3
Counts, on the per-thread basis, cycles during which at least one
uop is dispatched from the Reservation Station (RS) to port 3.

uops_dispatched_port.port_4
Counts, on the per-thread basis, cycles during which at least one
uop is dispatched from the Reservation Station (RS) to port 4.

uops_dispatched_port.port_5
Counts, on the per-thread basis, cycles during which at least one
uop is dispatched from the Reservation Station (RS) to port 5.

uops_dispatched_port.port_6
Counts, on the per-thread basis, cycles during which at least one
uop is dispatched from the Reservation Station (RS) to port 6.

uops_dispatched_port.port_7
Counts, on the per-thread basis, cycles during which at least one
uop is dispatched from the Reservation Station (RS) to port 7.

resource_stalls.any
Counts resource-related stall cycles.

resource_stalls.sb
Counts allocation stall cycles caused by the store buffer (SB)
being full. This counts cycles that the pipeline back-end blocked
uop delivery from the front-end.

cycle_activity.cycles_l2_miss
Cycles while L2 cache miss demand load is outstanding.

cycle_activity.cycles_l3_miss
Cycles while L3 cache miss demand load is outstanding.

cycle_activity.stalls_total
Total execution stalls.

cycle_activity.stalls_l2_miss
Execution stalls while L2 cache miss demand load is outstanding.

cycle_activity.stalls_l3_miss
Execution stalls while L3 cache miss demand load is outstanding.

cycle_activity.cycles_l1d_miss
Cycles while L1 cache miss demand load is outstanding.

cycle_activity.stalls_l1d_miss
Execution stalls while L1 cache miss demand load is outstanding.

cycle_activity.cycles_mem_any
Cycles while memory subsystem has an outstanding load.

cycle_activity.stalls_mem_any
Execution stalls while memory subsystem has an outstanding load.

exe_activity.exe_bound_0_ports
Counts cycles during which no uops were executed on all ports and
Reservation Station (RS) was not empty.

exe_activity.1_ports_util
Counts cycles during which a total of 1 uop was executed on all
ports and Reservation Station (RS) was not empty.

exe_activity.2_ports_util
Counts cycles during which a total of 2 uops were executed on all
ports and Reservation Station (RS) was not empty.

exe_activity.3_ports_util
Cycles total of 3 uops are executed on all ports and Reservation
Station (RS) was not empty.

exe_activity.4_ports_util
Cycles total of 4 uops are executed on all ports and Reservation
Station (RS) was not empty.

exe_activity.bound_on_stores
Cycles where the Store Buffer was full and no outstanding load.

lsd.uops
Number of uops delivered to the back-end by the LSD(Loop Stream
Detector).

lsd.cycles_active
Counts the cycles when at least one uop is delivered by the LSD
(Loop-stream detector).

dsb2mite_switches.count
This event counts the number of the Decode Stream Buffer (DSB)-to-
MITE switches including all misses because of missing Decode Stream
Buffer (DSB) cache and u-arch forced misses. Note: Invoking MITE
requires two or three cycles delay.

dsb2mite_switches.penalty_cycles
Counts Decode Stream Buffer (DSB)-to-MITE switch true penalty
cycles. These cycles do not include uops routed through because of
the switch itself, for example, when Instruction Decode Queue (IDQ)
pre-allocation is unavailable, or Instruction Decode Queue (IDQ) is
full. SBD-to-MITE switch true penalty cycles happen after the merge
mux (MM) receives Decode Stream Buffer (DSB) Sync-indication until
receiving the first MITE uop. MM is placed before Instruction
Decode Queue (IDQ) to merge uops being fed from the MITE and Decode
Stream Buffer (DSB) paths. Decode Stream Buffer (DSB) inserts the
Sync-indication whenever a Decode Stream Buffer (DSB)-to-MITE
switch occurs.Penalty: A Decode Stream Buffer (DSB) hit followed by
a Decode Stream Buffer (DSB) miss can cost up to six cycles in
which no uops are delivered to the IDQ. Most often, such switches
from the Decode Stream Buffer (DSB) to the legacy pipeline cost 02
cycles.

itlb.itlb_flush
Counts the number of flushes of the big or small ITLB pages.
Counting include both TLB Flush (covering all sets) and TLB Set
Clear (set-specific).

offcore_requests.demand_data_rd
Counts the Demand Data Read requests sent to uncore. Use it in
conjunction with OFFCORE_REQUESTS_OUTSTANDING to determine average
latency in the uncore.

offcore_requests.demand_code_rd
Counts both cacheable and non-cacheable code read requests.

offcore_requests.demand_rfo
Counts the demand RFO (read for ownership) requests including
regular RFOs, locks, ItoM.

offcore_requests.all_data_rd
Counts the demand and prefetch data reads. All Core Data Reads
include cacheable 'Demands' and L2 prefetchers (not L3
prefetchers). Counting also covers reads due to page walks resulted
from any request type.

offcore_requests.l3_miss_demand_data_rd
Demand Data Read requests who miss L3 cache.

offcore_requests.all_requests
Counts memory transactions reached the super queue including
requests initiated by the core, all L3 prefetches, page walks,
etc..

uops_executed.thread
Number of uops to be executed per-thread each cycle.

uops_executed.stall_cycles
Counts cycles during which no uops were dispatched from the
Reservation Station (RS) per thread.

uops_executed.cycles_ge_1_uop_exec
Cycles where at least 1 uop was executed per-thread.

uops_executed.cycles_ge_2_uops_exec
Cycles where at least 2 uops were executed per-thread.

uops_executed.cycles_ge_3_uops_exec
Cycles where at least 3 uops were executed per-thread.

uops_executed.cycles_ge_4_uops_exec
Cycles where at least 4 uops were executed per-thread.

uops_executed.core
Number of uops executed from any thread.

uops_executed.core_cycles_ge_1
Cycles at least 1 micro-op is executed from any thread on physical
core.

uops_executed.core_cycles_ge_2
Cycles at least 2 micro-op is executed from any thread on physical
core.

uops_executed.core_cycles_ge_3
Cycles at least 3 micro-op is executed from any thread on physical
core.

uops_executed.core_cycles_ge_4
Cycles at least 4 micro-op is executed from any thread on physical
core.

uops_executed.core_cycles_none
Cycles with no micro-ops executed from any thread on physical core.

uops_executed.x87
Counts the number of x87 uops executed.

offcore_requests_buffer.sq_full
Counts the number of cases when the offcore requests buffer cannot
take more entries for the core. This can happen when the superqueue
does not contain eligible entries, or when L1D writeback pending
FIFO requests is full.Note: Writeback pending FIFO has six entries.

tlb_flush.dtlb_thread
Counts the number of DTLB flush attempts of the thread-specific
entries.

tlb_flush.stlb_any
Counts the number of any STLB flush attempts (such as entire, VPID,
PCID, InvPage, CR3 write, etc.).

inst_retired.any_p
Counts the number of instructions (EOMs) retired. Counting covers
macro-fused instructions individually (that is, increments by two).

The following errata may apply to this: SKL091, SKL044

inst_retired.prec_dist
A version of INST_RETIRED that allows for a more unbiased
distribution of samples across instructions retired. It utilizes
the Precise Distribution of Instructions Retired (PDIR) feature to
mitigate some bias in how retired instructions get sampled.

The following errata may apply to this: SKL091, SKL044

inst_retired.total_cycles_ps
Number of cycles using an always true condition applied to PEBS
instructions retired event. (inst_ret< 16)

The following errata may apply to this: SKL091, SKL044

other_assists.any
Number of times a microcode assist is invoked by HW other than FP-
assist. Examples include AD (page Access Dirty) and AVX* related
assists.

uops_retired.retire_slots
Counts the retirement slots used.

uops_retired.stall_cycles
This event counts cycles without actually retired uops.

uops_retired.total_cycles
Number of cycles using always true condition (uops_ret < 16)
applied to non PEBS uops retired event.

uops_retired.macro_fused
Counts the number of macro-fused uops retired. (non precise)

machine_clears.count
Number of machine clears (nukes) of any type.

machine_clears.memory_ordering
Counts the number of memory ordering Machine Clears detected.
Memory Ordering Machine Clears can result from one of the
following:a. memory disambiguation,b. external snoop, orc. cross
SMT-HW-thread snoop (stores) hitting load buffer.

The following errata may apply to this: SKL089

machine_clears.smc
Counts self-modifying code (SMC) detected, which causes a machine
clear.

br_inst_retired.all_branches
Counts all (macro) branch instructions retired.

The following errata may apply to this: SKL091

br_inst_retired.conditional
This event counts conditional branch instructions retired.

The following errata may apply to this: SKL091

br_inst_retired.near_call
This event counts both direct and indirect near call instructions
retired.

The following errata may apply to this: SKL091

br_inst_retired.all_branches_pebs
This is a precise version of BR_INST_RETIRED.ALL_BRANCHES that
counts all (macro) branch instructions retired.

The following errata may apply to this: SKL091

br_inst_retired.near_return
This event counts return instructions retired.

The following errata may apply to this: SKL091

br_inst_retired.not_taken
This event counts not taken branch instructions retired.

The following errata may apply to this: SKL091

br_inst_retired.cond_ntaken
This event counts not taken branch instructions retired.

The following errata may apply to this: SKL091

br_inst_retired.near_taken
This event counts taken branch instructions retired.

The following errata may apply to this: SKL091

br_inst_retired.far_branch
This event counts far branch instructions retired.

The following errata may apply to this: SKL091

br_misp_retired.all_branches
Counts all the retired branch instructions that were mispredicted
by the processor. A branch misprediction occurs when the processor
incorrectly predicts the destination of the branch. When the
misprediction is discovered at execution, all the instructions
executed in the wrong (speculative) path must be discarded, and the
processor must start fetching from the correct path.

br_misp_retired.conditional
This event counts mispredicted conditional branch instructions
retired.

br_misp_retired.near_call
Counts both taken and not taken retired mispredicted direct and
indirect near calls, including both register and memory indirect.

br_misp_retired.all_branches_pebs
This is a precise version of BR_MISP_RETIRED.ALL_BRANCHES that
counts all mispredicted macro branch instructions retired.

br_misp_retired.near_taken
Number of near branch instructions retired that were mispredicted
and taken.

fp_arith_inst_retired.scalar_double
Number of SSE/AVX computational scalar double precision floating-
point instructions retired; some instructions will count twice as
noted below. Each count represents 1 computational operation.
Applies to SSE* and AVX* scalar double precision floating-point
instructions: ADD SUB MUL DIV MIN MAX SQRT FM(N)ADD/SUB.
FM(N)ADD/SUB instructions count twice as they perform 2
calculations per element.

fp_arith_inst_retired.scalar_single
Number of SSE/AVX computational scalar single precision floating-
point instructions retired; some instructions will count twice as
noted below. Each count represents 1 computational operation.
Applies to SSE* and AVX* scalar single precision floating-point
instructions: ADD SUB MUL DIV MIN MAX SQRT RSQRT RCP FM(N)ADD/SUB.
FM(N)ADD/SUB instructions count twice as they perform 2
calculations per element.

fp_arith_inst_retired.128b_packed_double
Number of SSE/AVX computational 128-bit packed double precision
floating-point instructions retired; some instructions will count
twice as noted below. Each count represents 2 computation
operations, one for each element. Applies to SSE* and AVX* packed
double precision floating-point instructions: ADD SUB HADD HSUB
SUBADD MUL DIV MIN MAX SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB
instructions count twice as they perform 2 calculations per
element.

fp_arith_inst_retired.128b_packed_single
Number of SSE/AVX computational 128-bit packed single precision
floating-point instructions retired; some instructions will count
twice as noted below. Each count represents 4 computation
operations, one for each element. Applies to SSE* and AVX* packed
single precision floating-point instructions: ADD SUB HADD HSUB
SUBADD MUL DIV MIN MAX SQRT RSQRT RCP DPP FM(N)ADD/SUB. DPP and
FM(N)ADD/SUB instructions count twice as they perform 2
calculations per element.

fp_arith_inst_retired.256b_packed_double
Number of SSE/AVX computational 256-bit packed double precision
floating-point instructions retired; some instructions will count
twice as noted below. Each count represents 4 computation
operations, one for each element. Applies to SSE* and AVX* packed
double precision floating-point instructions: ADD SUB HADD HSUB
SUBADD MUL DIV MIN MAX SQRT FM(N)ADD/SUB. FM(N)ADD/SUB
instructions count twice as they perform 2 calculations per
element.

fp_arith_inst_retired.256b_packed_single
Number of SSE/AVX computational 256-bit packed single precision
floating-point instructions retired; some instructions will count
twice as noted below. Each count represents 8 computation
operations, one for each element. Applies to SSE* and AVX* packed
single precision floating-point instructions: ADD SUB HADD HSUB
SUBADD MUL DIV MIN MAX SQRT RSQRT RCP DPP FM(N)ADD/SUB. DPP and
FM(N)ADD/SUB instructions count twice as they perform 2
calculations per element.

hle_retired.start
Number of times we entered an HLE region. Does not count nested
transactions.

hle_retired.commit
Number of times HLE commit succeeded.

hle_retired.aborted
Number of times HLE abort was triggered.

hle_retired.aborted_mem
Number of times an HLE execution aborted due to various memory
events (e.g., read/write capacity and conflicts).

hle_retired.aborted_timer
Number of times an HLE execution aborted due to hardware timer
expiration.

hle_retired.aborted_unfriendly
Number of times an HLE execution aborted due to HLE-unfriendly
instructions and certain unfriendly events (such as AD assists
etc.).

hle_retired.aborted_memtype
Number of times an HLE execution aborted due to incompatible memory
type.

hle_retired.aborted_events
Number of times an HLE execution aborted due to unfriendly events
(such as interrupts).

rtm_retired.start
Number of times we entered an RTM region. Does not count nested
transactions.

rtm_retired.commit
Number of times RTM commit succeeded.

rtm_retired.aborted
Number of times RTM abort was triggered.

rtm_retired.aborted_mem
Number of times an RTM execution aborted due to various memory
events (e.g. read/write capacity and conflicts).

rtm_retired.aborted_timer
Number of times an RTM execution aborted due to uncommon
conditions.

rtm_retired.aborted_unfriendly
Number of times an RTM execution aborted due to HLE-unfriendly
instructions.

rtm_retired.aborted_memtype
Number of times an RTM execution aborted due to incompatible memory
type.

rtm_retired.aborted_events
Number of times an RTM execution aborted due to none of the
previous 4 categories (e.g. interrupt).

fp_assist.any
Counts cycles with any input and output SSE or x87 FP assist. If an
input and output assist are detected on the same cycle the event
increments by 1.

hw_interrupts.received
Counts the number of hardware interruptions received by the
processor.

rob_misc_events.lbr_inserts
Increments when an entry is added to the Last Branch Record (LBR)
array (or removed from the array in case of RETURNs in call stack
mode). The event requires LBR enable via IA32_DEBUGCTL MSR and
branch type selection via MSR_LBR_SELECT.

rob_misc_events.pause_inst
Number of retired PAUSE instructions (that do not end up with a
VMExit to the VMM; TSX aborted Instructions may be counted). This
event is not supported on first SKL and KBL products.

mem_inst_retired.stlb_miss_loads
Retired load instructions that miss the STLB.

mem_inst_retired.stlb_miss_stores
Retired store instructions that miss the STLB.

mem_inst_retired.lock_loads
Retired load instructions with locked access.

mem_inst_retired.split_loads
Counts retired load instructions that split across a cacheline
boundary.

mem_inst_retired.split_stores
Counts retired store instructions that split across a cacheline
boundary.

mem_inst_retired.all_loads
All retired load instructions.

mem_inst_retired.all_stores
All retired store instructions.

mem_load_retired.l1_hit
Counts retired load instructions with at least one uop that hit in
the L1 data cache. This event includes all SW prefetches and lock
instructions regardless of the data source.

mem_load_retired.l2_hit
Retired load instructions with L2 cache hits as data sources.

mem_load_retired.l3_hit
Counts retired load instructions with at least one uop that hit in
the L3 cache.

mem_load_retired.l1_miss
Counts retired load instructions with at least one uop that missed
in the L1 cache.

mem_load_retired.l2_miss
Retired load instructions missed L2 cache as data sources.

mem_load_retired.l3_miss
Counts retired load instructions with at least one uop that missed
in the L3 cache.

mem_load_retired.fb_hit
Counts retired load instructions with at least one uop was load
missed in L1 but hit FB (Fill Buffers) due to preceding miss to the
same cache line with data not ready.

mem_load_l3_hit_retired.xsnp_miss
Retired load instructions which data sources were L3 hit and cross-
core snoop missed in on-pkg core cache.

mem_load_l3_hit_retired.xsnp_hit
Retired load instructions which data sources were L3 and cross-core
snoop hits in on-pkg core cache.

mem_load_l3_hit_retired.xsnp_hitm
Retired load instructions which data sources were HitM responses
from shared L3.

mem_load_l3_hit_retired.xsnp_none
Retired load instructions which data sources were hits in L3
without snoops required.

mem_load_misc_retired.uc
Retired instructions with at least 1 uncacheable load or lock.

baclears.any
Counts the number of times the front-end is resteered when it finds
a branch instruction in a fetch line. This occurs for the first
time a branch instruction is fetched or when the branch is not
tracked by the BPU (Branch Prediction Unit) anymore.

l2_trans.l2_wb
Counts L2 writebacks that access L2 cache.

l2_lines_in.all
Counts the number of L2 cache lines filling the L2. Counting does
not cover rejects.

l2_lines_out.silent
Counts the number of lines that are silently dropped by L2 cache
when triggered by an L2 cache fill. These lines are typically in
Shared or Exclusive state. A non-threaded event.

l2_lines_out.non_silent
Counts the number of lines that are evicted by L2 cache when
triggered by an L2 cache fill. Those lines are in Modified state.
Modified lines are written back to L3

l2_lines_out.useless_pref
This event is deprecated. Refer to new event
L2_LINES_OUT.USELESS_HWPF

l2_lines_out.useless_hwpf
Counts the number of lines that have been hardware prefetched but
not used and now evicted by L2 cache

sq_misc.split_lock
Counts the number of cache line split locks sent to the uncore.

SEE ALSO


cpc(3CPC)

https://download.01.org/perfmon/index/

illumos June 18, 2018 illumos