BSD-3-Clause licensed by Tony Day, Marco Zocca
Maintained by [email protected]
This version can be pinned in stack with:perf-0.13.0.0@sha256:286fd315955a4da84cdb05809d5ab8d495d51b36a14d7b87bee2dc795bcb34fb,2877

Table of Contents

  1. Introduction
  2. Setup
  3. System.Clock
    1. resolution
  4. Time
    1. What is a tick?
    2. tick_
    3. multiple ticks
    4. tickIO
    5. sum example
  5. PerfT
  6. perf-explore
    1. noops
    2. measurement context
      1. short list
      2. long list
    3. sums
    4. lengths
    5. Space
  7. Perf.BigO
  8. References
    1. Core
    2. Profiling
      1. setup
      2. Space usage output (-s)
      3. Cost center profile (-p)
      4. heap analysis (-hc -l)
    3. Cache speed

img img

Introduction

perf provides some ideas, code and a library for low-level performance measurement for Haskell hacking. The library:

  • provides a monad transformer, PerfT, as a light-weight wrapper for use on existing code. PerfT modifications can be included in code bases, as opposed to performance being separated code and process, with any effects able to be erased at compile time with evalPerfT.

  • focuses on fast and accurate measurement.

  • is polymorphic to what, exactly, is being measured, so that concepts such as counters, debug checks, time and space performance can share treatment.

  • can measure big O for algorithms that can be defined in terms of input size growth.

Setup

Note that running perf.org is very slow compared with an external process which accesses the compiled version of the library.

:r
:set -Wno-type-defaults
:set -Wno-unused-do-bind
:set -Wno-name-shadowing
:set -XOverloadedStrings
:set -XOverloadedLabels
import Perf
import Data.FormatN
import qualified Data.Text as Text
import qualified Data.Text.IO as Text
import qualified Data.Map.Strict as Map
import Control.Monad
import Data.Bifunctor
import System.Clock
putStrLn "ok"

[ 1 of 10] Compiling Perf.Stats       ( src/Perf/Stats.hs, interpreted ) [Source file changed]
[ 3 of 10] Compiling Perf.Time        ( src/Perf/Time.hs, interpreted ) [Source file changed]
[ 6 of 10] Compiling Perf.Measure     ( src/Perf/Measure.hs, interpreted ) [Source file changed]
[ 7 of 10] Compiling Perf.Report      ( src/Perf/Report.hs, interpreted ) [Source file changed]
[ 8 of 10] Compiling Perf.BigO        ( src/Perf/BigO.hs, interpreted ) [Perf.Stats changed]
[10 of 10] Compiling Perf             ( src/Perf.hs, interpreted ) [Perf.BigO changed]
Ok, 10 modules reloaded.
ok

System.Clock

The default clock is MonoticRaw for linux & macOS, and ThreadCPUTime for Windows.

resolution

getRes Monotonic
getRes Realtime
getRes ProcessCPUTime
getRes ThreadCPUTime
getRes MonotonicRaw

TimeSpec {sec = 0, nsec = 1000}
TimeSpec {sec = 0, nsec = 1000}
TimeSpec {sec = 0, nsec = 1000}
TimeSpec {sec = 0, nsec = 42}
TimeSpec {sec = 0, nsec = 42}

Time

What is a tick?

A fundamental operation of Perf.Time is tick, which sandwiches a (strict) function application between two readings of a clock, and returns time in nanoseconds, and the computation result. In this way, the `Perf` monad can be inserted into the midst of a computation in an attempt to measure performance in-situ as opposed to sitting off in a separate and decontextualized process.

:t tick

tick :: (a -> b) -> a -> IO (Nanos, b)

tick returns in the IO monad, because reading a cycle counter is an IO effect. A trivial but fundamental point is that performance measurement effects the computation being measured.

tick_

tick_ measures the nanoseconds between two immediate clock reads.

:t tick_

tick_ :: IO Nanos

replicateM 10 tick_

[1833,500,416,416,416,375,375,416,416,416]

multiple ticks

fmap (fmap (fst)) . replicateM 10 $ tick (const ()) ()

[7000,2333,2000,2208,1958,1959,1959,2000,2000,1959]

Here, const () () was evaluated and took 7 micro-seconds for the first effect, reducing down to 2 msecs after 10 effects.

tickIO

tickIO measures the evaluation of an IO value.

:t tickIO

tickIO :: IO a -> IO (Cycles, a)

fmap (fmap fst) . replicateM 10 $ tickIO (pure ())

[5541,1625,1458,1833,1375,1416,1375,1375,1375,1375]

sum example

fmap (expt (Just 2) . fromIntegral) . fst <$> ticks 10 sum ([1..10000] :: [Double])

["5.0e5","2.4e5","2.4e5","2.4e5","2.4e5","2.4e5","2.4e5","2.4e5","2.5e5","2.4e5"]

ts <- ticks 10000 sum ([1..1000] :: [Double])
print $ average (fmap fromIntegral $ fst ts)

10747.1975

PerfT

PerfT allows for multiple measurement points and is polymorphic in what is being measured. It returns a Map of results held in State.

Compare a lower-level usage of ticks, measuring the average of summing to one thousand over one thousand trials:

first (average . fmap fromIntegral) <$> ticks 1000 sum [1..1000]

(25947.635,500500)

… with PerfT usage

second (fmap (average . fmap fromIntegral)) <$> runPerfT (times 1000) (sum |$| [1..1000])

(500500,fromList [("",26217.098)])

Comparing performance of sum versus a list fusion approach:

fmap (average . fmap fromIntegral) <$> (execPerfT (times 1000) $ do; (fap "sum" sum [1..1000]); (fap "fusion" (\x -> sum [1..x]) 1000))

fromList [("fusion",32871.248),("sum",26924.128)]

An IO example

exampleIO' :: IO ()
exampleIO' = do
  txt <- Text.readFile "src/Perf.hs"
  let n = Text.length txt
  Text.putStrLn $ "length of file is: " <> Text.pack (show n)

exampleIO = execPerfT time (do
  txt <- fam "file_read" (Text.readFile "src/Perf.hs")
  n <- fap "length" Text.length txt
  fam "print_result" (Text.putStrLn $ "length of file is: " <> Text.pack (show n)))

perf-explore --exampleIO

length of file is: 1794
length of file is: 1794

label1          label2          label3          old result      new result      change

normal          file-read       time            2.31e5          1.28e5          improvement
normal          length          time            2.71e3          2.00e3          improvement
normal          print-result    time            3.75e4          1.32e4          improvement
outer           file-read       time            6.05e4          3.64e4          improvement
outer           length          time            9.59e2          6.25e2          improvement
outer           outer-total     time            7.39e4          4.02e4          improvement
outer           print-result    time            9.79e3          1.71e3          improvement

perf-explore

perf-explore contains some exploratory routines used to develop perf

perf-explore --help

examples of perf usage

Usage: perf-explore [-n|--runs ARG]
                    [--Monotonic | --Realtime | --ProcessCPUTime |
                      --ThreadCPUTime | --MonotonicRaw]
                    [--best | --median | --average]
                    [--time | --space | --spacetime | --allocation | --count]
                    [-g|--golden ARG] [--nocheck] [-r|--record]
                    [--header | --noheader] [--error ARG] [--warning ARG]
                    [--improved ARG]
                    [--sums | --lengths | --nub | --clocks | --examples |
                      --example | --exampleIO | --noops | --ticks]
                    [-l|--length ARG]
                    [--sumFuse | --sum | --lengthF | --constFuse | --mapInc |
                      --noOp]

  perf exploration

Available options:
  -n,--runs ARG            number of runs to perform
  --best                   report upper decile
  --median                 report median
  --average                report average
  --time                   measure time performance
  --space                  measure space performance
  --spacetime              measure both space and time performance
  --allocation             measure bytes allocated
  --count                  measure count
  -g,--golden ARG          golden file name
  --nocheck                do not check versus the golden file
  -r,--record              record the result to the golden file
  --header                 include headers
  --noheader               dont include headers
  --error ARG              error level
  --warning ARG            warning level
  --improved ARG           improved level
  --sums                   run on sum algorithms
  --lengths                run on length algorithms
  --nub                    nub test
  --clocks                 clock test
  --examples               run on example algorithms
  --example                run on the example algorithm
  --exampleIO              exampleIO test
  --noops                  noops test
  --ticks                  tick test
  -l,--length ARG          length of list
  --sumFuse                fused sum pipeline
  --sum                    sum
  --lengthF                foldr id length
  --constFuse              fused const pipeline
  --mapInc                 fmap (+1)
  --noOp                   const ()
  -h,--help                Show this help text

fmap averageI <$> execPerfT (times 10000) (sum |$| [1..1000])

fromList [("",136055.5594)]

The equivalent to the above code is:

perf-explore -n 10000 -l 1000 --sum --nocheck

label1          label2          results

sum             time            6.32e3

noops

This no-op experiment is useful to understand the pure time performance of the machinery around measurement. It can be (re)run with:

perf-explore --noops

label1          label2          label3          old result      new result      change

const           1st             time            1.72e4          8.79e3          improvement
const           2nd             time            2.09e2          1.25e2          improvement
const           3rd             time            1.66e2          1.25e2          improvement
const           4th             time            2.08e2          8.30e1          improvement
const           average         time            2.08e2          1.10e2          improvement
const           best            time            1.31e2          6.31e1          improvement
const           median          time            1.60e2          7.76e1          improvement
pure            1st             time            1.00e3          1.25e2          improvement
pure            2nd             time            1.67e2          8.30e1          improvement
pure            3rd             time            1.66e2          8.30e1          improvement
pure            4th             time            1.25e2          4.20e1          improvement
pure            average         time            1.85e2          8.29e1          improvement
pure            best            time            1.31e2          6.37e1          improvement
pure            median          time            1.63e2          7.79e1          improvement

measurement context

Exploration of how the code surrounding measurement effects performance.

perf-explore -n 1000 -l 1000 --ticks --nocheck

short list

perf-explore -n 10000 -l 10 --median --ticks

long list

perf-explore -n 100 -l 100000 --best --ticks

sums

perf-explore -n 1000 -l 1000 --sums

label1          label2          old result      new result      change

sumAux          time            5.53e3          5.21e3          improvement
sumCata         time            5.18e3          4.73e3          improvement
sumCo           time            6.50e3          6.40e3
sumCoCase       time            5.16e3          6.03e3          slightly-degraded
sumCoGo         time            6.11e3          5.88e3
sumF            time            5.44e3          4.16e3          improvement
sumFlip         time            7.23e3          7.07e3
sumFlipLazy     time            6.68e3          6.44e3
sumFoldr        time            5.23e3          5.00e3
sumFuse         time            6.85e2          6.81e2
sumFuseFoldl'   time            6.94e2          6.78e2
sumFuseFoldr    time            1.04e3          1.02e3
sumFusePoly     time            6.96e2          6.84e2
sumLambda       time            4.79e3          4.77e3
sumMono         time            4.82e3          4.84e3
sumPoly         time            4.77e3          4.81e3
sumSum          time            4.95e3          5.05e3
sumTail         time            7.32e3          7.10e3
sumTailLazy     time            6.75e3          6.52e3

lengths

perf-explore -n 1000 -l 1000 --lengths

label1          label2          old result      new result      change          

lengthAux       time            4.44e3          3.68e3          improvement     
lengthCo        time            4.91e3          4.45e3          improvement     
lengthCoCase    time            4.90e3          4.44e3          improvement
lengthF         time            3.38e3          3.21e3
lengthFMono     time            4.16e3          4.00e3
lengthFlip      time            5.49e3          4.90e3          improvement
lengthFlipLazy  time            5.32e3          4.77e3          improvement
lengthFoldr     time            4.23e3          3.90e3          improvement
lengthFoldrConsttime            3.98e3          3.74e3          improvement
lengthTail      time            6.47e3          5.30e3          improvement
lengthTailLazy  time            6.11e3          5.34e3          improvement

Space

 perf-explore -n 10 -l 100000 --space +RTS -T -RTS

 label1          label2          old result      new result      change

 sum             MaxMem          4.61e6          4.61e6
 sum             allocated       4.20e5          4.20e5
 sum             gcLiveBytes     2.15e5          2.17e5
 sum             gcollects       1.00e-1         1.00e-1
 sum             maxLiveBytes    0.00e0          0.00e0

Data is collected from GHCStats

  • allocated_bytes
  • gcs
  • gcdetails_live_bytes
  • max_live_bytes
  • max_mem_in_use_bytes

Perf.BigO

Perf.BigO represents functionality to determine the complexity order for a computation.

We could do a regression and minimise the error term, but we know that the largest run contains the most information; we would need to weight the simulations according to some heuristic.

Instead, we:

  • estimate the order factor for each possible Order, from N3 to N0, setting the highest n run constant factor to zero,

  • pick the order based on lowest absolute error result summed across all the runs,

    import qualified Prelude as P import Data.List (nub) estOrder (\x -> sum $ nub [1..x]) 10 [1,10,100,1000]

    BigOrder {bigOrder = N2, bigFactor = 4.05725, bigConstant = 0.0}

Cache speed

The average cycles per + operation can get down to about 0.7 cycles, and there are about 4 cache registers per cycle, so a sum pipeline uses 2.8 register instructions per +.

Changes

0.13

  • replaced rdtsc with clocks library.

0.12.0.0

  • added reportMain and support for executable development.
  • refactored app/perf-explore
  • created app/perf-bench and aded as a cabal bench thing.

0.11.0.0

  • added Perf.Count
  • GHC 9.6.2 support

0.10.0

  • GHC 9.2.1 support
  • inclusion of BigO

0.9.0

  • Removed deepseq dependency

0.8.0

  • GHC 9.0.1 support
  • internal fixes to remove numhask dependency