illumos: manual page: glp

GLP_EVENTS(3CPC) CPU Performance Counters Library Functions GLP_EVENTS(3CPC)

NAME

glp_events - processor model specific performance counter events

DESCRIPTION

This manual page describes events specific to the following Intel CPU
models and is derived from Intel's perfmon data. For more information,
please consult the Intel Software Developer's Manual or Intel's perfmon
website.

CPU models described by this document:

+o Family 0x6, Model 0x7a

The following events are supported:

ld_blocks.data_unknown
Counts a load blocked from using a store forward, but did not occur
because the store data was not available at the right time. The
forward might occur subsequently when the data is available.

ld_blocks.store_forward
Counts a load blocked from using a store forward because of an
address/size mismatch, only one of the loads blocked from each
store will be counted.

ld_blocks.4k_alias
Counts loads that block because their address modulo 4K matches a
pending store.

ld_blocks.utlb_miss
Counts loads blocked because they are unable to find their physical
address in the micro TLB (UTLB).

ld_blocks.all_block
Counts anytime a load that retires is blocked for any reason.

dtlb_load_misses.walk_completed_4k
Counts page walks completed due to demand data loads (including SW
prefetches) whose address translations missed in all TLB levels and
were mapped to 4K pages. The page walks can end with or without a
page fault.

dtlb_load_misses.walk_completed_2m_4m
Counts page walks completed due to demand data loads (including SW
prefetches) whose address translations missed in all TLB levels and
were mapped to 2M or 4M pages. The page walks can end with or
without a page fault.

dtlb_load_misses.walk_completed_1gb
Counts page walks completed due to demand data loads (including SW
prefetches) whose address translations missed in all TLB levels and
were mapped to 1GB pages. The page walks can end with or without a
page fault.

dtlb_load_misses.walk_pending
Counts once per cycle for each page walk occurring due to a load
(demand data loads or SW prefetches). Includes cycles spent
traversing the Extended Page Table (EPT). Average cycles per walk
can be calculated by dividing by the number of walks.

uops_issued.any
Counts uops issued by the front end and allocated into the back end
of the machine. This event counts uops that retire as well as uops
that were speculatively executed but didn't retire. The sort of
speculative uops that might be counted includes, but is not limited
to those uops issued in the shadow of a miss-predicted branch,
those uops that are inserted during an assist (such as for a
denormal floating point result), and (previously allocated) uops
that might be canceled during a machine clear.

misalign_mem_ref.load_page_split
Counts when a memory load of a uop spans a page boundary (a split)
is retired.

misalign_mem_ref.store_page_split
Counts when a memory store of a uop spans a page boundary (a split)
is retired.

longest_lat_cache.miss
Counts memory requests originating from the core that miss in the
L2 cache.

longest_lat_cache.reference
Counts memory requests originating from the core that reference a
cache line in the L2 cache.

l2_reject_xq.all
Counts the number of demand and prefetch transactions that the L2
XQ rejects due to a full or near full condition which likely
indicates back pressure from the intra-die interconnect (IDI)
fabric. The XQ may reject transactions from the L2Q (non-cacheable
requests), L2 misses and L2 write-back victims.

core_reject_l2q.all
Counts the number of demand and L1 prefetcher requests rejected by
the L2Q due to a full or nearly full condition which likely
indicates back pressure from L2Q. It also counts requests that
would have gone directly to the XQ, but are rejected due to a full
or nearly full condition, indicating back pressure from the IDI
link. The L2Q may also reject transactions from a core to insure
fairness between cores, or to delay a core's dirty eviction when
the address conflicts with incoming external snoops.

cpu_clk_unhalted.core_p
Core cycles when core is not halted. This event uses a
(_P)rogrammable general purpose performance counter.

cpu_clk_unhalted.ref
Reference cycles when core is not halted. This event uses a
(_P)rogrammable general purpose performance counter.

dtlb_store_misses.walk_completed_4k
Counts page walks completed due to demand data stores whose address
translations missed in the TLB and were mapped to 4K pages. The
page walks can end with or without a page fault.

dtlb_store_misses.walk_completed_2m_4m
Counts page walks completed due to demand data stores whose address
translations missed in the TLB and were mapped to 2M or 4M pages.
The page walks can end with or without a page fault.

dtlb_store_misses.walk_completed_1gb
Counts page walks completed due to demand data stores whose address
translations missed in the TLB and were mapped to 1GB pages. The
page walks can end with or without a page fault.

dtlb_store_misses.walk_pending
Counts once per cycle for each page walk occurring due to a demand
data store. Includes cycles spent traversing the Extended Page
Table (EPT). Average cycles per walk can be calculated by dividing
by the number of walks.

ept.walk_pending
Counts once per cycle for each page walk only while traversing the
Extended Page Table (EPT), and does not count during the rest of
the translation. The EPT is used for translating Guest-Physical
Addresses to Physical Addresses for Virtual Machine Monitors
(VMMs). Average cycles per walk can be calculated by dividing the
count by number of walks.

dl1.replacement
Counts when a modified (dirty) cache line is evicted from the data
L1 cache and needs to be written back to memory. No count will
occur if the evicted line is clean, and hence does not require a
writeback.

icache.hit
Counts requests to the Instruction Cache (ICache) for one or more
bytes in an ICache Line and that cache line is in the ICache (hit).
The event strives to count on a cache line basis, so that multiple
accesses which hit in a single cache line count as one ICACHE.HIT.
Specifically, the event counts when straight line code crosses the
cache line boundary, or when a branch target is to a new line, and
that cache line is in the ICache. This event counts differently
than Intel processors based on Silvermont microarchitecture.

icache.misses
Counts requests to the Instruction Cache (ICache) for one or more
bytes in an ICache Line and that cache line is not in the ICache
(miss). The event strives to count on a cache line basis, so that
multiple accesses which miss in a single cache line count as one
ICACHE.MISS. Specifically, the event counts when straight line
code crosses the cache line boundary, or when a branch target is to
a new line, and that cache line is not in the ICache. This event
counts differently than Intel processors based on Silvermont
microarchitecture.

icache.accesses
Counts requests to the Instruction Cache (ICache) for one or more
bytes in an ICache Line. The event strives to count on a cache
line basis, so that multiple fetches to a single cache line count
as one ICACHE.ACCESS. Specifically, the event counts when accesses
from straight line code crosses the cache line boundary, or when a
branch target is to a new line. This event counts differently than
Intel processors based on Silvermont microarchitecture.

itlb.miss
Counts the number of times the machine was unable to find a
translation in the Instruction Translation Lookaside Buffer (ITLB)
for a linear address of an instruction fetch. It counts when new
translation are filled into the ITLB. The event is speculative in
nature, but will not count translations (page walks) that are begun
and not finished, or translations that are finished but not filled
into the ITLB.

itlb_misses.walk_completed_4k
Counts page walks completed due to instruction fetches whose
address translations missed in the TLB and were mapped to 4K pages.
The page walks can end with or without a page fault.

itlb_misses.walk_completed_2m_4m
Counts page walks completed due to instruction fetches whose
address translations missed in the TLB and were mapped to 2M or 4M
pages. The page walks can end with or without a page fault.

itlb_misses.walk_completed_1gb
Counts page walks completed due to instruction fetches whose
address translations missed in the TLB and were mapped to 1GB
pages. The page walks can end with or without a page fault.

itlb_misses.walk_pending
Counts once per cycle for each page walk occurring due to an
instruction fetch. Includes cycles spent traversing the Extended
Page Table (EPT). Average cycles per walk can be calculated by
dividing by the number of walks.

fetch_stall.all
Counts cycles that fetch is stalled due to any reason. That is, the
decoder queue is able to accept bytes, but the fetch unit is unable
to provide bytes. This will include cycles due to an ITLB miss,
ICache miss and other events.

fetch_stall.itlb_fill_pending_cycles
Counts cycles that fetch is stalled due to an outstanding ITLB
miss. That is, the decoder queue is able to accept bytes, but the
fetch unit is unable to provide bytes due to an ITLB miss. Note:
this event is not the same as page walk cycles to retrieve an
instruction translation.

fetch_stall.icache_fill_pending_cycles
Counts cycles that fetch is stalled due to an outstanding ICache
miss. That is, the decoder queue is able to accept bytes, but the
fetch unit is unable to provide bytes due to an ICache miss. Note:
this event is not the same as the total number of cycles spent
retrieving instruction cache lines from the memory hierarchy.

uops_not_delivered.any
This event used to measure front-end inefficiencies. I.e. when
front-end of the machine is not delivering uops to the back-end and
the back-end has is not stalled. This event can be used to identify
if the machine is truly front-end bound. When this event occurs,
it is an indication that the front-end of the machine is operating
at less than its theoretical peak performance. Background: We can
think of the processor pipeline as being divided into 2 broader
parts: Front-end and Back-end. Front-end is responsible for
fetching the instruction, decoding into uops in machine
understandable format and putting them into a uop queue to be
consumed by back end. The back-end then takes these uops, allocates
the required resources. When all resources are ready, uops are
executed. If the back-end is not ready to accept uops from the
front-end, then we do not want to count these as front-end
bottlenecks. However, whenever we have bottlenecks in the back-
end, we will have allocation unit stalls and eventually forcing the
front-end to wait until the back-end is ready to receive more uops.
This event counts only when back-end is requesting more uops and
front-end is not able to provide them. When 3 uops are requested
and no uops are delivered, the event counts 3. When 3 are
requested, and only 1 is delivered, the event counts 2. When only 2
are delivered, the event counts 1. Alternatively stated, the event
will not count if 3 uops are delivered, or if the back end is
stalled and not requesting any uops at all. Counts indicate missed
opportunities for the front-end to deliver a uop to the back end.
Some examples of conditions that cause front-end efficiencies are:
ICache misses, ITLB misses, and decoder restrictions that limit the
front-end bandwidth. Known Issues: Some uops require multiple
allocation slots. These uops will not be charged as a front end
'not delivered' opportunity, and will be regarded as a back end
problem. For example, the INC instruction has one uop that requires
2 issue slots. A stream of INC instructions will not count as
UOPS_NOT_DELIVERED, even though only one instruction can be issued
per clock. The low uop issue rate for a stream of INC instructions
is considered to be a back end issue.

tlb_flushes.stlb_any
Counts STLB flushes. The TLBs are flushed on instructions like
INVLPG and MOV to CR3.

inst_retired.any_p
Counts the number of instructions that retire execution. For
instructions that consist of multiple uops, this event counts the
retirement of the last uop of the instruction. The event continues
counting during hardware interrupts, traps, and inside interrupt
handlers. This is an architectural performance event. This event
uses a (_P)rogrammable general purpose performance counter. *This
event is Precise Event capable: The EventingRIP field in the PEBS
record is precise to the address of the instruction which caused
the event. Note: Because PEBS records can be collected only on
IA32_PMC0, only one event can use the PEBS facility at a time.

inst_retired.prec_dist
Counts INST_RETIRED.ANY using the Reduced Skid PEBS feature that
reduces the shadow in which events aren't counted allowing for a
more unbiased distribution of samples across instructions retired.

uops_retired.any
Counts uops which retired.

uops_retired.ms
Counts uops retired that are from the complex flows issued by the
micro-sequencer (MS). Counts both the uops from a micro-coded
instruction, and the uops that might be generated from a micro-
coded assist.

uops_retired.fpdiv
Counts the number of floating point divide uops retired.

uops_retired.idiv
Counts the number of integer divide uops retired.

machine_clears.all
Counts machine clears for any reason.

machine_clears.smc
Counts the number of times that the processor detects that a
program is writing to a code section and has to perform a machine
clear because of that modification. Self-modifying code (SMC)
causes a severe penalty in all Intel(R) architecture processors.

machine_clears.memory_ordering
Counts machine clears due to memory ordering issues. This occurs
when a snoop request happens and the machine is uncertain if memory
ordering will be preserved - as another core is in the process of
modifying the data.

machine_clears.fp_assist
Counts machine clears due to floating point (FP) operations needing
assists. For instance, if the result was a floating point
denormal, the hardware clears the pipeline and reissues uops to
produce the correct IEEE compliant denormal result.

machine_clears.disambiguation
Counts machine clears due to memory disambiguation. Memory
disambiguation happens when a load which has been issued conflicts
with a previous unretired store in the pipeline whose address was
not known at issue time, but is later resolved to be the same as
the load address.

machine_clears.page_fault
Counts the number of times that the machines clears due to a page
fault. Covers both I-side and D-side(Loads/Stores) page faults. A
page fault occurs when either page is not present, or an access
violation

br_inst_retired.all_branches
Counts branch instructions retired for all branch types. This is
an architectural performance event.

br_inst_retired.jcc
Counts retired Jcc (Jump on Conditional Code/Jump if Condition is
Met) branch instructions retired, including both when the branch
was taken and when it was not taken.

br_inst_retired.all_taken_branches
Counts the number of taken branch instructions retired.

br_inst_retired.far_branch
Counts far branch instructions retired. This includes far jump,
far call and return, and Interrupt call and return.

br_inst_retired.non_return_ind
Counts near indirect call or near indirect jmp branch instructions
retired.

br_inst_retired.return
Counts near return branch instructions retired.

br_inst_retired.call
Counts near CALL branch instructions retired.

br_inst_retired.ind_call
Counts near indirect CALL branch instructions retired.

br_inst_retired.rel_call
Counts near relative CALL branch instructions retired.

br_inst_retired.taken_jcc
Counts Jcc (Jump on Conditional Code/Jump if Condition is Met)
branch instructions retired that were taken and does not count when
the Jcc branch instruction were not taken.

br_misp_retired.all_branches
Counts mispredicted branch instructions retired including all
branch types.

br_misp_retired.jcc
Counts mispredicted retired Jcc (Jump on Conditional Code/Jump if
Condition is Met) branch instructions retired, including both when
the branch was supposed to be taken and when it was not supposed to
be taken (but the processor predicted the opposite condition).

br_misp_retired.non_return_ind
Counts mispredicted branch instructions retired that were near
indirect call or near indirect jmp, where the target address taken
was not what the processor predicted.

br_misp_retired.return
Counts mispredicted near RET branch instructions retired, where the
return address taken was not what the processor predicted.

br_misp_retired.ind_call
Counts mispredicted near indirect CALL branch instructions retired,
where the target address taken was not what the processor
predicted.

br_misp_retired.taken_jcc
Counts mispredicted retired Jcc (Jump on Conditional Code/Jump if
Condition is Met) branch instructions retired that were supposed to
be taken but the processor predicted that it would not be taken.

issue_slots_not_consumed.any
Counts the number of issue slots per core cycle that were not
consumed by the backend due to either a full resource in the
backend (RESOURCE_FULL) or due to the processor recovering from
some event (RECOVERY).

issue_slots_not_consumed.resource_full
Counts the number of issue slots per core cycle that were not
consumed because of a full resource in the backend. Including but
not limited to resources such as the Re-order Buffer (ROB),
reservation stations (RS), load/store buffers, physical registers,
or any other needed machine resource that is currently unavailable.
Note that uops must be available for consumption in order for this
event to fire. If a uop is not available (Instruction Queue is
empty), this event will not count.

issue_slots_not_consumed.recovery
Counts the number of issue slots per core cycle that were not
consumed by the backend because allocation is stalled waiting for a
mispredicted jump to retire or other branch-like conditions (e.g.
the event is relevant during certain microcode flows). Counts all
issue slots blocked while within this window including slots where
uops were not available in the Instruction Queue.

hw_interrupts.received
Counts hardware interrupts received by the processor.

hw_interrupts.masked
Counts the number of core cycles during which interrupts are masked
(disabled). Increments by 1 each core cycle that EFLAGS.IF is 0,
regardless of whether interrupts are pending or not.

hw_interrupts.pending_and_masked
Counts core cycles during which there are pending interrupts, but
interrupts are masked (EFLAGS.IF = 0).

cycles_div_busy.all
Counts core cycles if either divide unit is busy.

cycles_div_busy.idiv
Counts core cycles the integer divide unit is busy.

cycles_div_busy.fpdiv
Counts core cycles the floating point divide unit is busy.

mem_uops_retired.dtlb_miss_loads
Counts load uops retired that caused a DTLB miss.

mem_uops_retired.dtlb_miss_stores
Counts store uops retired that caused a DTLB miss.

mem_uops_retired.dtlb_miss
Counts uops retired that had a DTLB miss on load, store or either.
Note that when two distinct memory operations to the same page miss
the DTLB, only one of them will be recorded as a DTLB miss.

mem_uops_retired.lock_loads
Counts locked memory uops retired. This includes regular locks and
bus locks. (To specifically count bus locks only, see the Offcore
response event.) A locked access is one with a lock prefix, or an
exchange to memory. See the SDM for a complete description of
which memory load accesses are locks.

mem_uops_retired.split_loads
Counts load uops retired where the data requested spans a 64 byte
cache line boundary.

mem_uops_retired.split_stores
Counts store uops retired where the data requested spans a 64 byte
cache line boundary.

mem_uops_retired.split
Counts memory uops retired where the data requested spans a 64 byte
cache line boundary.

mem_uops_retired.all_loads
Counts the number of load uops retired.

mem_uops_retired.all_stores
Counts the number of store uops retired.

mem_uops_retired.all
Counts the number of memory uops retired that is either a loads or
a store or both.

mem_load_uops_retired.l1_hit
Counts load uops retired that hit the L1 data cache.

mem_load_uops_retired.l2_hit
Counts load uops retired that hit in the L2 cache.

mem_load_uops_retired.l1_miss
Counts load uops retired that miss the L1 data cache.

mem_load_uops_retired.l2_miss
Counts load uops retired that miss in the L2 cache.

mem_load_uops_retired.hitm
Counts load uops retired where the cache line containing the data
was in the modified state of another core or modules cache (HITM).
More specifically, this means that when the load address was
checked by other caching agents (typically another processor) in
the system, one of those caching agents indicated that they had a
dirty copy of the data. Loads that obtain a HITM response incur
greater latency than most is typical for a load. In addition,
since HITM indicates that some other processor had this data in its
cache, it implies that the data was shared between processors, or
potentially was a lock or semaphore value. This event is useful
for locating sharing, false sharing, and contended locks.

mem_load_uops_retired.wcb_hit
Counts memory load uops retired where the data is retrieved from
the WCB (or fill buffer), indicating that the load found its data
while that data was in the process of being brought into the L1
cache. Typically a load will receive this indication when some
other load or prefetch missed the L1 cache and was in the process
of retrieving the cache line containing the data, but that process
had not yet finished (and written the data back to the cache). For
example, consider load X and Y, both referencing the same cache
line that is not in the L1 cache. If load X misses cache first, it
obtains and WCB (or fill buffer) and begins the process of
requesting the data. When load Y requests the data, it will either
hit the WCB, or the L1 cache, depending on exactly what time the
request to Y occurs.

mem_load_uops_retired.dram_hit
Counts memory load uops retired where the data is retrieved from
DRAM. Event is counted at retirement, so the speculative loads are
ignored. A memory load can hit (or miss) the L1 cache, hit (or
miss) the L2 cache, hit DRAM, hit in the WCB or receive a HITM
response.

baclears.all
Counts the number of times a BACLEAR is signaled for any reason,
including, but not limited to indirect branch/call, Jcc (Jump on
Conditional Code/Jump if Condition is Met) branch, unconditional
branch/call, and returns.

baclears.return
Counts BACLEARS on return instructions.

baclears.cond
Counts BACLEARS on Jcc (Jump on Conditional Code/Jump if Condition
is Met) branches.

ms_decoded.ms_entry
Counts the number of times the Microcode Sequencer (MS) starts a
flow of uops from the MSROM. It does not count every time a uop is
read from the MSROM. The most common case that this counts is when
a micro-coded instruction is encountered by the front end of the
machine. Other cases include when an instruction encounters a
fault, trap, or microcode assist of any sort that initiates a flow
of uops. The event will count MS startups for uops that are
speculative, and subsequently cleared by branch mispredict or a
machine clear.

decode_restriction.predecode_wrong
Counts the number of times the prediction (from the predecode
cache) for instruction length is incorrect.

NAME

DESCRIPTION

SEE ALSO