BSD-3-Clause licensed by Tony Day, Marco Zocca
Maintained by [email protected]
This version can be pinned in stack with:perf-0.12.0.1@sha256:8ae58c3061b6492866447814e806e2372908fd74e954cf8f964dcd0760bda9fd,4413
#+TITLE: perf
#+PROPERTY: header-args :exports both
#+PROPERTY: header-args :eval no-export

[[https://hackage.haskell.org/package/perf][file:https://img.shields.io/hackage/v/perf.svg]] [[https://github.com/tonyday567/perf/actions?query=workflow%3Ahaskell-ci][file:https://github.com/tonyday567/perf/workflows/haskell-ci/badge.svg]]

* Introduction

~perf~ provides some ideas, code and a library for low-level performance measurement for Haskell hacking. The library:

- provides a monad transformer, ~PerfT~, as a light-weight wrapper for use on existing code. ~PerfT~ modifications can be included in code bases, as opposed to performance being separated code and process, with any effects able to be erased at compile time with ~evalPerfT~.

- focuses on using the [[https://en.wikipedia.org/wiki/Time_Stamp_Counter][rdtsc]] instruction set that comes with the x86 chip to measure accurate time performance as number of clock cycles.

- abstracts the concept of what is being measured, so that concepts such as counters, debug checks, time and space performance can be treated equivalently.

* Setup

Note that running this readme.org is very slow compared with an external process which accesses the compiled version of the library.

#+begin_src haskell-ng :results output
:r
:set -Wno-type-defaults
:set -Wno-unused-do-bind
:set -Wno-name-shadowing
:set -XOverloadedStrings
:set -XOverloadedLabels
import Perf
import Data.FormatN
import qualified Data.Text as Text
import qualified Data.Text.IO as Text
import qualified Data.Map.Strict as Map
import Control.Monad
import Data.Bifunctor
putStrLn "ok"
#+end_src

#+RESULTS:
#+begin_example
Build profile: -w ghc-9.8.1 -O1
In order, the following will be built (use -v for more details):
- perf-0.12.0.1 (lib) (first run)
Preprocessing library for perf-0.12.0.1..
GHCi, version 9.8.1: https://www.haskell.org/ghc/ :? for help
Loaded GHCi configuration from /Users/tonyday/haskell/perf/.ghci
[ 1 of 10] Compiling Perf.Stats ( src/Perf/Stats.hs, interpreted )
[ 2 of 10] Compiling Perf.Types ( src/Perf/Types.hs, interpreted )
[ 3 of 10] Compiling Perf.Time ( src/Perf/Time.hs, interpreted )
[ 4 of 10] Compiling Perf.Space ( src/Perf/Space.hs, interpreted )
[ 5 of 10] Compiling Perf.Count ( src/Perf/Count.hs, interpreted )
[ 6 of 10] Compiling Perf.Measure ( src/Perf/Measure.hs, interpreted )
[ 7 of 10] Compiling Perf.Report ( src/Perf/Report.hs, interpreted )
[ 8 of 10] Compiling Perf.BigO ( src/Perf/BigO.hs, interpreted )
[ 9 of 10] Compiling Perf.Algos ( src/Perf/Algos.hs, interpreted )
[10 of 10] Compiling Perf ( src/Perf.hs, interpreted )
Ok, 10 modules loaded.
Ok, 10 modules loaded.
ok
#+end_example

* Time
** What is a tick?

The fundamental operation of Perf.Time is tick, which sandwiches a (strict) function application between two readings of the rdstc chip.

#+begin_src haskell-ng :results output :exports both
:t tick
#+end_src

#+RESULTS:
: tick :: (a -> b) -> a -> IO (Cycles, b)

tick returns in the IO monad, because reading a cycle counter is an IO effect. The trivial but fundamental point is that performance measurement effects the computation being measured.

Unlike benchmarking-style libraries, such as criterion, the result of the computation is passed through along with the cycle count. In this way, the `Perf` monad can be inserted into the midst of a computation in an attempt to measure performance in-situ as opposed to sitting off in a separate and perhaps unrepresentative process.

** tick_

tick_ measures the number of cycles between two clock reads.

#+begin_src haskell-ng :results output :exports both
:t tick_
#+end_src

#+RESULTS:
: tick_ :: IO Cycles


#+begin_src haskell-ng :results output :exports both
fmap word <$> replicateM 10 tick_
#+end_src

#+RESULTS:
: [3063,1201,990,978,945,948,951,948,951,948]

** multiple ticks

#+begin_src haskell-ng :results output :exports both
fmap (fmap (word . fst)) . replicateM 10 $ tick (const ()) ()
#+end_src

#+RESULTS:
: [12104,5040,4868,4508,4188,4172,4800,4748,5760,4216]

Here, ~const () ()~ was evaluated and took 6.1k cycles for the first effect, reducing down to 2.6k after 10 effects. What it did in that time we will never know, really, at least from usage of ~perf~. 2.2k cycles is, on my 2.5GHz machine equal to 2200 / 2.5e9 = 8.8e-7 of a second or 880 nanoseconds.

** tickIO

~tickIO~ measures the evaluation of an IO value.

#+begin_src haskell-ng :results output :exports both
:t tickIO
#+end_src

#+RESULTS:
: tickIO :: IO a -> IO (Cycles, a)

#+begin_src haskell-ng :results output :exports both
fmap (fmap (word . fst)) . replicateM 10 $ tickIO (pure ())
#+end_src

#+RESULTS:
: [10084,3588,4104,2844,2916,3140,2800,2848,2848,2804]

** sum example

#+begin_src haskell-ng :exports both
fmap (expt (Just 2) . fromIntegral) . fst <$> ticks 10 sum ([1..10000] :: [Double])
#+end_src

#+RESULTS:
: ["1.7e6","6.1e5","6.2e5","5.4e5","4.3e5","4.1e5","4.1e5","4.3e5","4.1e5","4.1e5"]


#+begin_src haskell-ng :results output :exports both
ts <- ticks 10000 sum ([1..1000] :: [Double])
print $ average (fmap fromIntegral $ fst ts)
#+end_src

#+RESULTS:
: 30411.6957

* PerfT

~PerfT~ allows for multiple measurement points and is polymorphic in what is being measured. It returns a Map of results held in State.

Compare a lower-level usage of ticks, measuring the average of summing to one thousand over one thousand trials:

#+begin_src haskell-ng :results output :exports both
first (average . fmap fromIntegral) <$> ticks 1000 sum [1..1000]
#+end_src

#+RESULTS:
: (50914.804,500500)

... with PerfT usage

#+begin_src haskell-ng :results output :exports both
second (fmap (average . fmap fromIntegral)) <$> runPerfT (times 1000) (sum |$| [1..1000])
#+end_src

#+RESULTS:
: (500500,fromList [("",44362.311)])

Comparing performance of sum versus a list fusion approach:

#+begin_src haskell-ng :results output :exports both
fmap (average . fmap fromIntegral) <$> (execPerfT (times 1000) $ do; (fap "sum" sum [1..1000]); (fap "fusion" (\x -> sum [1..x]) 1000))
#+end_src

#+RESULTS:
: fromList [("fusion",97804.085),("sum",45053.771)]

An IO example

#+begin_src haskell-ng
exampleIO' :: IO ()
exampleIO' = do
txt <- Text.readFile "src/Perf.hs"
let n = Text.length txt
Text.putStrLn $ "length of file is: " <> Text.pack (show n)
#+end_src

#+begin_src haskell-ng :results output :exports both
exampleIO = execPerfT time (do
txt <- fam "file_read" (Text.readFile "src/Perf.hs")
n <- fap "length" Text.length txt
fam "print_result" (Text.putStrLn $ "length of file is: " <> Text.pack (show n)))
#+end_src

#+begin_src sh :results output :exports both
perf-explore --exampleIO
#+end_src

#+RESULTS:
#+begin_example
length of file is: 1794
length of file is: 1794

label1 label2 label3 old result new result change

normal file-read time 5.96e5 6.58e5 slightly-degraded
normal length time 1.10e4 7.12e3 improvement
normal print-result time 1.44e5 5.70e4 improvement
outer file-read time 1.36e5 1.39e5
outer length time 1.45e3 1.33e3 improvement
outer outer-total time 1.67e5 1.50e5 improvement
outer print-result time 2.70e4 6.26e3 improvement
#+end_example

* perf-explore

~perf-explore~ contains some exploratory routines used to develop =perf=

#+begin_src sh :results output :exports both
perf-explore --help
#+end_src

#+RESULTS:
#+begin_example
examples of perf usage

Usage: perf-explore [-n|--runs ARG]
[--best | --median | --average | --averagesecs]
[--time | --space | --spacetime | --allocation | --count]
[-g|--golden ARG] [--nocheck] [-r|--record]
[--header | --noheader] [--error ARG] [--warning ARG]
[--improved ARG]
[--sums | --lengths | --nub | --examples | --example |
--exampleIO | --noops | --ticks | --gauge]
[-l|--length ARG]
[--sumFuse | --sum | --lengthF | --constFuse | --mapInc |
--noOp]

perf exploration

Available options:
-n,--runs ARG number of runs to perform
--best report upper decile
--median report median
--average report average
--averagesecs report average in seconds
--time measure time performance
--space measure space performance
--spacetime measure both space and time performance
--allocation measure bytes allocated
--count measure count
-g,--golden ARG golden file name
--nocheck do not check versus the golden file
-r,--record record the result to the golden file
--header include headers
--noheader dont include headers
--error ARG error level
--warning ARG warning level
--improved ARG improved level
--sums run on sum algorithms
--lengths run on length algorithms
--nub nub test
--examples run on example algorithms
--example run on the example algorithm
--exampleIO exampleIO test
--noops noops test
--ticks tick test
--gauge gauge runs on exmaple for comparison
-l,--length ARG length of list
--sumFuse fused sum pipeline
--sum sum
--lengthF foldr id length
--constFuse fused const pipeline
--mapInc fmap (+1)
--noOp const ()
-h,--help Show this help text
#+end_example

#+begin_src haskell-ng :results output :exports both
fmap averageI <$> execPerfT (times 10000) (sum |$| [1..1000])
#+end_src

#+RESULTS:
: fromList [("",35729.9518)]

The equivalent to the above code is:

#+begin_src sh :results output :exports both
perf-explore -n 10000 -l 1000 --sum --nocheck
#+end_src

#+RESULTS:
: label1 label2 results
:
: sum time 1.47e4

** noops

This no-op experiment is useful to understand the pure time performance of the machinery around measurement. It can be (re)run with:

#+begin_src sh :results output :exports both
perf-explore --noops
#+end_src

#+RESULTS:
#+begin_example
label1 label2 label3 results

const 1st time 6.05e3
const 2nd time 2.80e1
const 3rd time 3.60e1
const 4th time 3.20e1
const average time 4.11e1
const best time 2.74e1
const median time 3.54e1
pure 1st time 6.46e3
pure 2nd time 7.60e1
pure 3rd time 4.00e1
pure 4th time 4.00e1
pure average time 4.28e1
pure best time 2.82e1
pure median time 3.64e1
No golden file found. To create one, run with '-r'
#+end_example

** measurement context

Exploration of how the code surrounding measurement effects performance.

#+begin_src sh :results drawer :exports both
perf-explore -n 1000 -l 1000 --ticks --nocheck
#+end_src

#+RESULTS:
:results:
| | stepTime | tick | tickForce | tickForceArgs | tickLazy | tickWHNF | times |
| sumAux | 1.46e4 | 1.51e4 | 1.65e4 | 1.45e4 | 1.96e1 | 1.45e4 | 1.46e4 |
| sumCata | 1.45e4 | 1.82e4 | 1.45e4 | 1.45e4 | 1.99e1 | 1.46e4 | 1.48e4 |
| sumCo | 1.45e4 | 1.46e4 | 1.46e4 | 1.63e4 | 2.02e1 | 1.45e4 | 1.44e4 |
| sumCoCase | 1.44e4 | 1.51e4 | 1.61e4 | 1.46e4 | 2.78e1 | 2.10e4 | 1.49e4 |
| sumCoGo | 2.61e4 | 1.46e4 | 3.82e4 | 2.55e4 | 2.04e1 | 1.45e4 | 2.59e4 |
| sumF | 1.34e4 | 1.34e4 | 2.01e4 | 1.64e4 | 2.01e1 | 1.33e4 | 1.33e4 |
| sumFlip | 1.37e4 | 1.34e4 | 1.34e4 | 1.33e4 | 2.06e1 | 1.41e4 | 1.33e4 |
| sumFlipLazy | 1.33e4 | 1.34e4 | 1.33e4 | 1.34e4 | 2.01e1 | 1.37e4 | 1.34e4 |
| sumFoldr | 1.44e4 | 1.45e4 | 2.18e4 | 1.68e4 | 3.22e1 | 1.54e4 | 1.45e4 |
| sumFuse | 3.19e3 | 1.98e3 | 1.93e3 | 2.19e3 | 1.98e1 | 1.83e3 | 3.15e3 |
| sumFuseFoldl' | 1.88e3 | 2.46e3 | 2.02e3 | 2.27e3 | 2.39e1 | 2.38e3 | 2.15e3 |
| sumFuseFoldr | 5.79e3 | 6.01e3 | 3.70e3 | 3.70e3 | 2.00e1 | 3.70e3 | 6.11e3 |
| sumFusePoly | 2.67e3 | 2.68e3 | 2.65e3 | 2.29e3 | 2.73e1 | 2.44e3 | 2.60e3 |
| sumLambda | 1.59e4 | 1.35e4 | 1.47e4 | 1.99e4 | 1.99e1 | 1.36e4 | 1.34e4 |
| sumMono | 1.38e4 | 1.69e4 | 1.35e4 | 1.36e4 | 2.04e1 | 1.37e4 | 1.34e4 |
| sumPoly | 1.99e4 | 1.42e4 | 1.35e4 | 1.73e4 | 1.97e1 | 1.35e4 | 1.51e4 |
| sumSum | 1.34e4 | 1.36e4 | 1.34e4 | 1.35e4 | 1.99e1 | 1.34e4 | 2.01e4 |
| sumTail | 2.03e4 | 1.37e4 | 1.34e4 | 1.34e4 | 1.99e1 | 1.34e4 | 2.01e4 |
| sumTailLazy | 1.34e4 | 1.66e4 | 1.33e4 | 1.41e4 | 1.99e1 | 1.35e4 | 1.33e4 |
:end:

*** short list
#+begin_src sh :results drawer :exports both
perf-explore -n 10000 -l 10 --best --ticks
#+end_src

#+RESULTS:
:results:
| | stepTime | tick | tickForce | tickForceArgs | tickLazy | tickWHNF | times |
| sumAux | 1.81e2 | 2.33e2 | 1.81e2 | 1.82e2 | 1.32e1 | 1.83e2 | 1.82e2 |
| sumCata | 1.81e2 | 2.33e2 | 1.81e2 | 1.82e2 | 1.33e1 | 2.33e2 | 1.81e2 |
| sumCo | 2.10e2 | 2.33e2 | 2.13e2 | 2.11e2 | 2.49e1 | 2.33e2 | 2.10e2 |
| sumCoCase | 1.81e2 | 2.47e2 | 1.82e2 | 1.82e2 | 1.83e1 | 2.47e2 | 1.81e2 |
| sumCoGo | 1.81e2 | 2.34e2 | 2.34e2 | 2.34e2 | 1.33e1 | 2.33e2 | 1.82e2 |
| sumF | 1.03e2 | 2.22e2 | 1.05e2 | 1.04e2 | 1.88e1 | 1.83e2 | 1.04e2 |
| sumFlip | 1.13e2 | 1.28e2 | 1.20e2 | 1.20e2 | 1.88e1 | 1.17e2 | 1.63e2 |
| sumFlipLazy | 1.13e2 | 2.07e2 | 1.76e2 | 1.20e2 | 2.49e1 | 2.04e2 | 1.01e2 |
| sumFoldr | 1.81e2 | 2.66e2 | 2.65e2 | 1.81e2 | 1.88e1 | 2.65e2 | 1.81e2 |
| sumFuse | 2.58e1 | 2.71e1 | 2.85e1 | 3.36e1 | 1.36e1 | 3.36e1 | 3.42e1 |
| sumFuseFoldl' | 5.04e1 | 2.71e1 | 5.11e1 | 5.08e1 | 1.41e1 | 2.65e1 | 3.96e1 |
| sumFuseFoldr | 8.16e1 | 8.61e1 | 6.26e1 | 7.34e1 | 1.25e1 | 8.47e1 | 8.54e1 |
| sumFusePoly | 4.19e1 | 5.82e1 | 4.26e1 | 5.81e1 | 1.30e1 | 5.91e1 | 4.30e1 |
| sumLambda | 1.39e2 | 1.30e2 | 1.60e2 | 1.63e2 | 1.33e1 | 1.29e2 | 1.43e2 |
| sumMono | 8.04e1 | 1.43e2 | 8.64e1 | 8.41e1 | 1.83e1 | 1.15e2 | 8.16e1 |
| sumPoly | 1.16e2 | 1.29e2 | 1.44e2 | 1.21e2 | 2.49e1 | 1.30e2 | 1.18e2 |
| sumSum | 1.02e2 | 1.03e2 | 1.04e2 | 1.05e2 | 1.33e1 | 1.02e2 | 1.04e2 |
| sumTail | 1.40e2 | 1.37e2 | 1.31e2 | 1.31e2 | 1.36e1 | 1.45e2 | 1.42e2 |
| sumTailLazy | 1.06e2 | 1.78e2 | 1.12e2 | 1.12e2 | 1.84e1 | 2.25e2 | 1.09e2 |
:end:


*** long list
#+begin_src sh :results drawer :exports both
perf-explore -n 100 -l 100000 --best --ticks
#+end_src

#+RESULTS:
:results:
| | stepTime | tick | tickForce | tickForceArgs | tickLazy | tickWHNF | times |
| sumAux | 2.82e6 | 2.70e6 | 2.66e6 | 2.75e6 | 1.83e1 | 2.70e6 | 2.79e6 |
| sumCata | 2.71e6 | 2.69e6 | 2.63e6 | 2.64e6 | 1.28e1 | 2.68e6 | 2.75e6 |
| sumCo | 2.71e6 | 2.73e6 | 2.63e6 | 2.64e6 | 2.52e1 | 2.71e6 | 2.73e6 |
| sumCoCase | 2.69e6 | 2.70e6 | 2.70e6 | 2.65e6 | 1.85e1 | 2.76e6 | 2.67e6 |
| sumCoGo | 2.68e6 | 2.77e6 | 2.65e6 | 2.63e6 | 1.82e1 | 2.75e6 | 2.69e6 |
| sumF | 1.24e6 | 1.24e6 | 1.24e6 | 1.24e6 | 1.26e1 | 1.24e6 | 1.27e6 |
| sumFlip | 1.16e6 | 1.14e6 | 1.14e6 | 1.16e6 | 1.35e1 | 1.13e6 | 1.16e6 |
| sumFlipLazy | 1.03e6 | 1.03e6 | 1.03e6 | 1.02e6 | 1.37e1 | 1.03e6 | 1.03e6 |
| sumFoldr | 2.76e6 | 2.83e6 | 2.66e6 | 2.67e6 | 1.82e1 | 2.81e6 | 2.74e6 |
| sumFuse | 1.45e5 | 1.45e5 | 1.45e5 | 1.44e5 | 1.39e1 | 1.45e5 | 1.45e5 |
| sumFuseFoldl' | 1.45e5 | 1.45e5 | 1.45e5 | 1.46e5 | 1.35e1 | 1.45e5 | 1.45e5 |
| sumFuseFoldr | 1.76e6 | 1.77e6 | 1.75e6 | 1.75e6 | 1.90e1 | 1.76e6 | 1.75e6 |
| sumFusePoly | 1.45e5 | 2.00e5 | 1.45e5 | 1.45e5 | 1.32e1 | 1.68e5 | 1.45e5 |
| sumLambda | 9.35e5 | 9.52e5 | 9.35e5 | 9.38e5 | 2.48e1 | 9.46e5 | 9.44e5 |
| sumMono | 1.13e6 | 1.12e6 | 1.11e6 | 1.11e6 | 1.23e1 | 1.11e6 | 1.11e6 |
| sumPoly | 1.03e6 | 1.04e6 | 1.02e6 | 1.03e6 | 1.32e1 | 1.03e6 | 1.03e6 |
| sumSum | 1.20e6 | 1.22e6 | 1.20e6 | 1.19e6 | 1.32e1 | 1.22e6 | 1.20e6 |
| sumTail | 1.22e6 | 1.22e6 | 1.22e6 | 1.22e6 | 1.28e1 | 1.22e6 | 1.22e6 |
| sumTailLazy | 1.16e6 | 1.19e6 | 1.18e6 | 1.16e6 | 1.29e1 | 1.18e6 | 1.16e6 |
:end:

** sums

#+begin_src sh :output drawer :exports both
perf-explore -n 1000 -l 1000 --sums
#+end_src

#+RESULTS:
#+begin_example
label1 label2 old result new result change

sumAux time 1.51e4 1.46e4
sumCata time 1.55e4 1.46e4 improvement
sumCo time 1.55e4 2.01e4 degraded
sumCoCase time 1.57e4 1.47e4 improvement
sumCoGo time 1.54e4 1.69e4 slightly-degraded
sumF time 1.33e4 2.02e4 degraded
sumFlip time 1.41e4 1.36e4
sumFlipLazy time 1.47e4 1.93e4 degraded
sumFoldr time 1.56e4 1.75e4 slightly-degraded
sumFuse time 1.66e3 2.61e3 degraded
sumFuseFoldl' time 1.71e3 2.70e3 degraded
sumFuseFoldr time 5.83e3 6.72e3 slightly-degraded
sumFusePoly time 2.12e3 2.64e3 degraded
sumLambda time 1.47e4 1.39e4 improvement
sumMono time 1.73e4 1.40e4 improvement
sumPoly time 1.87e4 1.36e4 improvement
sumSum time 1.33e4 1.34e4
sumTail time 1.47e4 1.40e4 improvement
sumTailLazy time 1.44e4 1.34e4 improvement
#+end_example

** lengths

#+begin_src sh :exports both
perf-explore -n 1000 -l 1000 --lengths
#+end_src

#+RESULTS:
#+begin_example
label1 label2 old result new result change

lengthAux time 1.45e4 1.60e4 slightly-degraded
lengthCo time 1.64e4 1.39e4 improvement
lengthCoCase time 1.78e4 1.31e4 improvement
lengthF time 1.06e4 1.10e4
lengthFMono time 1.05e4 1.05e4
lengthFlip time 1.11e4 1.07e4
lengthFlipLazy time 1.13e4 1.07e4 improvement
lengthFoldr time 1.27e4 1.33e4 slightly-degraded
lengthFoldrConsttime 1.27e4 1.36e4 slightly-degraded
lengthTail time 1.08e4 1.10e4
lengthTailLazy time 1.15e4 1.03e4 improvement
#+end_example

** Gauge comparison

#+begin_src sh :results output :exports both
perf-explore -n 1000 -l 1000 --average --gauge
#+end_src

#+RESULTS:
#+begin_example
sumFuse
benchmarking function ... function time 849.4 ns

benchmarking function ... function time 848.6 ns

sum
benchmarking function ... function time 3.786 μs

benchmarking function ... function time 3.791 μs

lengthF
benchmarking function ... function time 1.863 μs

benchmarking function ... function time 1.872 μs

constFuse
benchmarking function ... function time 571.0 ns

benchmarking function ... function time 1.080 μs

mapInc
benchmarking function ... function time 9.203 ns

benchmarking function ... function time 15.99 μs

noop
benchmarking function ... function time 5.770 ns

benchmarking function ... function time 4.740 ns

#+end_example

** Space

Data is collected from GHCStats

- allocated_bytes
- gcs
- gcdetails_live_bytes
- max_live_bytes
- max_mem_in_use_bytes

#+begin_src sh :results ouput :exports both
perf-explore -n 10 -l 100000 --space +RTS -T -RTS
#+end_src

#+RESULTS:
: label1 label2 results
:
: sum MaxMem 4.61e6
: sum allocated 4.19e5
: sum gcLiveBytes 2.19e5
: sum gcollects 1.00e-1
: sum maxLiveBytes 0.00e0
: No golden file found. To create one, run with '-r'

* Perf.BigO

Perf.BigO represents functionality to determine the complexity order for a computation.

We could do a regression and minimise the error term, but we know that the largest run contains the most information; we would need to weight the simulations according to some heuristic.

Instead, we:

- estimate the order factor for each possible Order, from N3 to N0, setting the highest n run constant factor to zero,
- pick the order based on lowest absolute error result summed across all the runs,

#+begin_src haskell-ng :results output :exports both
import qualified Prelude as P
import Data.List (nub)
estOrder (\x -> sum $ nub [1..x]) 10 [1,10,100,1000]
#+end_src

#+RESULTS:
: BigOrder {bigOrder = N2, bigFactor = 13.375059, bigConstant = 0.0}

* References

[[https://github.com/haskell-perf/checklist][The Haskell performance checklist]]

[[https://github.com/ndmitchell/spaceleak][ndmitchell/spaceleak: Notes on space leaks]]

** Core

[[https://ghc.gitlab.haskell.org/ghc/doc/users_guide/debugging.html#options-debugging][5.13. Debugging the compiler]]

#+begin_src sh
ghc app/speed.hs -ddump-simpl -ddump-to-file -fforce-recomp -dlint -O
#+end_src

[[https://wiki.haskell.org/Performance/GHC#Looking_at_the_Core][haskell wiki: Looking at the Core]]

[[https://godbolt.org/][godbolt]]

[[https://gitlab.haskell.org/ghc/ghc/-/issues/15185][ghc issue 15185: Enum instance for IntX / WordX are inefficient]]

[[https://fixpt.de/blog/2017-12-04-strictness-analysis-part-1.html][fixpt - All About Strictness Analysis (part 1)]]

** Profiling
*** setup
[[https://ghc.gitlab.haskell.org/ghc/doc/users_guide/profiling.html#prof-heap][8. Profiling]]

A typical configuration step for profiling:

#+begin_src sh :results output
cabal configure --enable-library-profiling --enable-executable-profiling -fprof-auto -fprof -write-ghc-environment-files=always
#+end_src

A cabal.project.local with profiling enabled:

#+begin_quote
write-ghc-environment-files: always
ignore-project: False
flags: +prof +prof-auto
library-profiling: True
executable-profiling: True
#+end_quote

Examples from markup-parse R&D:

Executable compilation:

#+begin_src sh :results output
ghc -prof -fprof-auto -rtsopts app/speed0.hs -threaded -fforce-recomp
#+end_src

Executable run:

#+begin_src sh :results output
app/speed0 +RTS -s -p -hc -l -RTS
#+end_src

*** Space usage output (-s)

#+begin_example
885,263,472 bytes allocated in the heap
8,507,448 bytes copied during GC
163,200 bytes maximum residency (4 sample(s))
27,752 bytes maximum slop
6 MiB total memory in use (0 MiB lost due to fragmentation)

Tot time (elapsed) Avg pause Max pause
Gen 0 207 colls, 0 par 0.009s 0.010s 0.0001s 0.0002s
Gen 1 4 colls, 0 par 0.001s 0.001s 0.0004s 0.0005s

TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

INIT time 0.006s ( 0.006s elapsed)
MUT time 0.367s ( 0.360s elapsed)
GC time 0.010s ( 0.011s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.001s ( 0.001s elapsed)
Total time 0.384s ( 0.380s elapsed)
#+end_example

*** Cost center profile (-p)

Dumped to speed0.prof

#+begin_example
COST CENTRE MODULE SRC %time %alloc

token MarkupParse src/MarkupParse.hs:(259,1)-(260,20) 50.2 50.4
wrappedQ' MarkupParse.FlatParse src/MarkupParse/FlatParse.hs:(215,1)-(217,78) 20.8 23.1
ws_ MarkupParse.FlatParse src/MarkupParse/FlatParse.hs:(135,1)-(146,4) 14.3 5.5
eq MarkupParse.FlatParse src/MarkupParse/FlatParse.hs:243:1-30 10.6 11.1
gather MarkupParse src/MarkupParse.hs:(420,1)-(428,100) 2.4 3.7
runParser FlatParse.Basic src/FlatParse/Basic.hs:(217,1)-(225,24) 1.0 6.0
#+end_example

*** heap analysis (-hc -l)

#+begin_src sh :results output
eventlog2html speed0.eventlog
#+end_src

Produces speed0.eventlog.html which contains heap charts.

** Cache speed

The average cycles per + operation can get down to about 0.7 cycles, and there are about 4 cache registers per cycle, so a sum pipeline uses 2.8 register instructions per +.

| Cache | Cycles |
|-------------------|----------------|
| register | 4 per cycle |
| L1 Cache access | 3-4 cycles |
| L2 Cache access | 11-12 cycles |
| L3 unified access | 30 - 40 |
| DRAM hit | 195 cycles |
| L1 miss | 40 cycles |
| L2 miss | >600 cycles |


Changes

0.12.0.0

  • added reportMain and support for executable development.
  • refactored app/perf-explore
  • created app/perf-bench and aded as a cabal bench thing.

0.11.0.0

  • added Perf.Count
  • GHC 9.6.2 support

0.10.0

  • GHC 9.2.1 support
  • inclusion of BigO

0.9.0

  • Removed deepseq dependency

0.8.0

  • GHC 9.0.1 support
  • internal fixes to remove numhask dependency