unicode-transforms

Unicode normalization

http://github.com/composewell/unicode-transforms

LTS Haskell 22.43:	0.4.0.1@rev:6
Stackage Nightly 2024-11-25:	0.4.0.1@rev:6
Latest on Hackage:	0.4.0.1@rev:6

See all snapshots unicode-transforms appears in

BSD-3-Clause licensed by Harendra Kumar

Maintained by [email protected]

This version can be pinned in stack with:unicode-transforms-0.4.0.1@sha256:ffb1d1d489dda5248e860e22b13c7523799bc841c2c195cc2aca569a63025451,5952

Module documentation for 0.4.0.1

Data
- Data.ByteString
  - Data.ByteString.UTF8
    - Data.ByteString.UTF8.Normalize
- Data.Text
  - Data.Text.Normalize
- Data.Unicode
  - Data.Unicode.Types

Depends on 5 packages(full list with versions):

base, bytestring, ghc-prim, text, unicode-data

Used by 3 packages in nightly-2024-11-25(full list with versions):

commonmark, ENIG, pandoc

Unicode Transforms

Fast Unicode 14.0.0 normalization in Haskell (NFC, NFKC, NFD, NFKD).

What is normalization?

Unicode characters with adornments (e.g. Á) can be represented in two different forms, as a single composed character (U+00C1 = Á) or as multiple decomposed characters (U+0041(A) U+0301( ́ ) = Á). They are differently encoded byte sequences but for humans they have exactly the same visual appearance.

A regular byte comparison may tell that two strings are different even though they might be equivalent. We need to convert both the strings in a normalized form using the Unicode Character Database before we can compare them for equivalence. For example:

>> import Data.Text.Normalize
>> normalize NFC "\193" == normalize NFC "\65\769"
True

Performance

Normalization performance comparison of this package (v0.3.7) with the text-icu package using the ICU C++ library version ICU4C 65.1 on macOS. The benchmarks compare the time taken in milliseconds to normalize files in different languages and normalization forms using both the packages. In most cases unicode-transforms outperforms ICU.

Benchmark       unicode-transforms(ms) ICU(ms)    % Diff
--------------- ---------------------- -------   --------
NFKD/Korean                       7.78   37.10    +376.87
NFD/Korean                        7.86   37.06    +371.50
NFKD/Vietnamese                   6.85   12.48     +82.20
NFKD/Deutsch                      2.17    3.55     +63.30
NFKD/English                      1.71    2.78     +62.30
NFKC/Korean                       4.77    7.65     +60.28
NFD/Deutsch                       2.24    3.53     +57.41
NFD/English                       1.76    2.77     +57.32
NFC/Vietnamese                   10.66   16.63     +56.00
NFKC/Vietnamese                  10.95   16.58     +51.43
NFD/Devanagari                    6.48    8.68     +34.10
NFC/Devanagari                    6.77    8.49     +25.48
NFD/AllChars                      6.18    7.41     +19.91
NFD/Japanese                      7.80    9.20     +17.99
NFKC/Devanagari                   7.33    8.48     +15.74
NFKD/Japanese                     8.71   10.05     +15.39
NFD/Vietnamese                    5.94    6.83     +14.99
NFKD/Devanagari                   7.59    8.68     +14.27
NFKD/AllChars                     9.80   10.66      +8.82
NFKC/Deutsch                      3.21    3.18      -0.72
NFC/Korean                        4.62    4.38      -5.35
NFKC/English                      2.21    2.06      -6.88
NFC/English                       2.19    2.04      -7.21
NFKC/AllChars                    14.67    9.75     -50.51
NFC/Deutsch                       3.02    1.95     -54.39
NFKC/Japanese                    12.46    5.42    -129.93
NFC/AllChars                      9.72    3.58    -171.63
NFC/Japanese                     11.90    3.04    -292.04

Talks

Talks: Functional Conf 2018 Video | Functional Conf 2018 Slides

Contributing

Please use https://github.com/harendra-kumar/unicode-transforms to raise issues, or send pull requests.

Changes

0.4.0.1 (March 2022)

Support text-icu == 0.8.* in tests
Support unicode-data == 0.3.*

0.4.0 (November 2021)

Bump unicode-data to 0.2
Allow text 2.0
Update to Unicode version 14.0.0
Drop support for GHC 7.10

0.3.8

Allow ghc-prim 0.7
Extract unicode-data into its own package
Depend on the latest stable text

0.3.7.1

Fix x32 build

0.3.7

Significant performance improvements
Update to Unicode version 13.0.0

0.3.6

Update to Unicode version 12.1.0
Update Quickcheck dependency version bounds
Test with GHC 8.6.5

0.3.5

Update dependency version bounds
Test with GHC 8.6.2

0.3.4

GHC 8.4.1 support

0.3.3

GHC 8.2.1 support

0.3.2

Work around a GHC/LLVM issue for ARM

0.3.1

Update dependency versions

0.3.0

Support Unicode version 9.0

0.2.1

Improve speed and resource hog during compilation

0.2.0

Support Unicode version 8.0
Switch to pure Haskell implementation

0.1.0.1

Initial release based on utf8proc C implementation