lens-regex-pcre
A lensy interface to regular expressions
https://github.com/ChrisPenner/lens-regex-pcre#readme
LTS Haskell 22.40: | 1.1.0.0 |
Stackage Nightly 2024-11-08: | 1.1.0.0 |
Latest on Hackage: | 1.1.0.0 |
lens-regex-pcre-1.1.0.0@sha256:a6053fefae59f6b53b4741e5a75c5fae350af80dbf62e8673b565e1c3f34f8b9,2209
Module documentation for 1.1.0.0
- Control
- Control.Lens
- Control.Lens.Regex
- Control.Lens
lens-regex-pcre
Based on pcre-heavy
; so it should support any regexes or options which it supports.
Performance is equal, sometimes better than that of pcre-heavy
alone.
Which module should you use?
If you need unicode support, use Control.Lens.Regex.Text
, if not then Control.Lens.Regex.ByteString
is faster.
Working with Regexes in Haskell kinda sucks; it’s tough to figure out which libs
to use, and even after you pick one it’s tough to figure out how to use it; lens-regex-pcre
hopes to replace most other solutions by being fast, easy to set up, more adaptable with a more consistent interface.
It helps that there are already HUNDREDS of combinators which interop with lenses :smile:.
As it turns out; regexes are a very lens-like tool; Traversals allow you to select and alter zero or more matches; traversals can even carry indexes so you know which match or group you’re working on.
Examples
txt :: Text
txt = "raindrops on roses and whiskers on kittens"
-- Search
>>> has [regex|whisk|] . match txt
True
-- Get matches
>>> txt ^.. [regex|\br\w+|] . match
["raindrops","roses"]
-- Edit matches
>>> txt & [regex|\br\w+|] . match %~ T.intersperse '-' . T.toUpper
"R-A-I-N-D-R-O-P-S on R-O-S-E-S and whiskers on kittens"
-- Get Groups
>>> txt ^.. [regex|(\w+) on (\w+)|] . groups
[["raindrops","roses"],["whiskers","kittens"]]
-- Edit Groups
>>> txt & [regex|(\w+) on (\w+)|] . groups %~ reverse
"roses on raindrops and kittens on whiskers"
-- Get the third match
>>> txt ^? [regex|\w+|] . index 2 . match
Just "roses"
-- Match integers, 'Read' them into ints, then sort them in-place
-- dumping them back into the source text afterwards.
>>> "Monday: 29, Tuesday: 99, Wednesday: 3"
& partsOf ([regex|\d+|] . match . unpacked . _Show @Int) %~ sort
"Monday: 3, Tuesday: 29, Wednesday: 99"
Basically anything you want to do is possible somehow.
Performance
See the benchmarks.
Summary
Caveat: I’m by no means a benchmarking expert; if you have tips on how to do this better I’m all ears!
- Search
lens-regex-pcre
is marginally slower thanpcre-heavy
, but well within acceptable margins (within 0.6%) - Replace
lens-regex-pcre
beatspcre-heavy
by ~10% - Modify
pcre-heavy
doesn’t support this operation at all, so I guesslens-regex-pcre
wins here :)
How can it possibly be faster if it’s based on pcre-heavy
? lens-regex-pcre
only uses pcre-heavy
for finding the matches, not substitution/replacement. After that it splits the text into chunks and traverses over them with whichever operation you’ve chosen. The nature of this implementation makes it a lot easier to understand than imperative implementations of the same thing. This means it’s pretty easy to make edits, and is also the reason we can support arbitrary traversals/actions. It was easy enough, so I went ahead and made the whole thing use ByteString Builders, which sped it up a lot. I suspect that pcre-heavy
can benefit from the same optimization if anyone feels like back-porting it; it could be (almost) as nicely using simple traverse
without any lenses. The whole thing is only about 25 LOC.
I’m neither a benchmarks nor stats person, so please open an issue if anything here seems fishy.
Without pcre-light
and pcre-heavy
this library wouldn’t be possible, so huge thanks to all contributors!
Here are the benchmarks on my 2013 Macbook (2.6 Ghz i5)
benchmarking static pattern search/pcre-heavy ... took 20.78 s, total 56 iterations
benchmarked static pattern search/pcre-heavy
time 375.3 ms (372.0 ms .. 378.5 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 378.1 ms (376.4 ms .. 380.8 ms)
std dev 3.747 ms (922.3 μs .. 5.609 ms)
benchmarking static pattern search/lens-regex-pcre ... took 20.79 s, total 56 iterations
benchmarked static pattern search/lens-regex-pcre
time 379.5 ms (376.2 ms .. 382.4 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 377.3 ms (376.5 ms .. 378.4 ms)
std dev 1.667 ms (1.075 ms .. 2.461 ms)
benchmarking complex pattern search/pcre-heavy ... took 95.95 s, total 56 iterations
benchmarked complex pattern search/pcre-heavy
time 1.741 s (1.737 s .. 1.746 s)
1.000 R² (1.000 R² .. 1.000 R²)
mean 1.746 s (1.744 s .. 1.749 s)
std dev 4.499 ms (3.186 ms .. 6.080 ms)
benchmarking complex pattern search/lens-regex-pcre ... took 97.26 s, total 56 iterations
benchmarked complex pattern search/lens-regex-pcre
time 1.809 s (1.736 s .. 1.908 s)
0.996 R² (0.991 R² .. 1.000 R²)
mean 1.757 s (1.742 s .. 1.810 s)
std dev 42.83 ms (11.51 ms .. 70.69 ms)
benchmarking simple replacement/pcre-heavy ... took 23.32 s, total 56 iterations
benchmarked simple replacement/pcre-heavy
time 423.8 ms (422.4 ms .. 425.3 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 424.0 ms (422.9 ms .. 426.2 ms)
std dev 2.684 ms (1.239 ms .. 4.270 ms)
benchmarking simple replacement/lens-regex-pcre ... took 20.84 s, total 56 iterations
benchmarked simple replacement/lens-regex-pcre
time 382.8 ms (374.3 ms .. 391.5 ms)
0.999 R² (0.999 R² .. 1.000 R²)
mean 378.2 ms (376.3 ms .. 381.0 ms)
std dev 3.794 ms (2.577 ms .. 5.418 ms)
benchmarking complex replacement/pcre-heavy ... took 24.77 s, total 56 iterations
benchmarked complex replacement/pcre-heavy
time 448.1 ms (444.7 ms .. 450.0 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 450.8 ms (449.5 ms .. 453.9 ms)
std dev 3.129 ms (947.0 μs .. 4.841 ms)
benchmarking complex replacement/lens-regex-pcre ... took 21.99 s, total 56 iterations
benchmarked complex replacement/lens-regex-pcre
time 399.9 ms (398.4 ms .. 402.2 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 399.6 ms (399.0 ms .. 400.4 ms)
std dev 1.135 ms (826.2 μs .. 1.604 ms)
Benchmark lens-regex-pcre-bench: FINISH
Behaviour
Precise Expected behaviour (and examples) can be found in the test suites:
Changes
Changelog for lens-regex-pcre
1.0.1.0
BREAKING CHANGES
This release fixes a pretty major bugs surrounding the behaviour of optional groups. It’s unlikely but still possible that the change to grouping behaviour has changed the behaviour of your application, but most likely it just fixed some bugs you didn’t know you had yet…
- Handle optional or alternated groups like
pcre-heavy
. This may change group behaviour on regular expressions which had groups with optional groups. E.g.:A(x)?(B)
(A)|(B)|(C)
- Switch
groups
fromIndexedTraversal'
toIndexedLens'
. Since all lenses are valid traversals this shouldn’t cause any breakages. - Add
namedGroups
andnamedGroup
1.0.0.0
- Add
regexing
andmakeRegexTraversalQQ
- Replace
regex
traversal maker withregex
QuasiQuoter - Split Control.Lens.Regex into Control.Lens.Regex.Text and Control.Lens.Regex.ByteString
- Move regexBS to
Control.Lens.Regex.ByteString.regex
- Change whole implementation to use ByteString Builders for a massive speedup
- Monomorphise
Match text
->Match
- Add groups to index of
match
and match to index ofgroups
&group
- Add
group = groups . ix n
for accessing a single group.
0.3.1.0
- Match -> Match text
- Added regexBS to run regex on ByteStrings directly
0.3.0.0
- Unify
iregex
intoregex
as a single indexed traversal
0.2.0.0
- Unify
grouped
,groups
, andigroups
into justgroups
with optional traversal
0.1.1.0
- Adds
grouped
andmatchAndGroups
0.1.0.1
- Doc fixes
0.1.0.0
- Initial Release