unicode-data

Access Unicode Character Database (UCD)

http://github.com/composewell/unicode-data

LTS Haskell 22.39:	0.4.0.1@rev:3
Stackage Nightly 2024-10-31:	0.6.0@rev:2
Latest on Hackage:	0.6.0@rev:2

See all snapshots unicode-data appears in

Apache-2.0 licensed by Composewell Technologies and Contributors

Maintained by [email protected]

This version can be pinned in stack with:unicode-data-0.6.0@sha256:4a42e51b4c50ffc3960999d3a2e94bcea17a55480fe5fead0904c493a513a9f8,5804

Module documentation for 0.6.0

Unicode
- Unicode.Char
- Unicode.Internal
  - Unicode.Internal.Bits
  - Unicode.Internal.Char
    - Unicode.Internal.Char.Blocks
    - Unicode.Internal.Char.CaseFolding
    - Unicode.Internal.Char.DerivedCoreProperties
    - Unicode.Internal.Char.DerivedNumericValues
    - Unicode.Internal.Char.PropList
    - Unicode.Internal.Char.SpecialCasing
      - Unicode.Internal.Char.SpecialCasing.LowerCaseMapping
      - Unicode.Internal.Char.SpecialCasing.TitleCaseMapping
      - Unicode.Internal.Char.SpecialCasing.UpperCaseMapping
    - Unicode.Internal.Char.UnicodeData
      - Unicode.Internal.Char.UnicodeData.CombiningClass
      - Unicode.Internal.Char.UnicodeData.Compositions
      - Unicode.Internal.Char.UnicodeData.Decomposable
      - Unicode.Internal.Char.UnicodeData.DecomposableK
      - Unicode.Internal.Char.UnicodeData.Decompositions
      - Unicode.Internal.Char.UnicodeData.DecompositionsK
      - Unicode.Internal.Char.UnicodeData.DecompositionsK2
      - Unicode.Internal.Char.UnicodeData.GeneralCategory
      - Unicode.Internal.Char.UnicodeData.SimpleLowerCaseMapping
      - Unicode.Internal.Char.UnicodeData.SimpleTitleCaseMapping
      - Unicode.Internal.Char.UnicodeData.SimpleUpperCaseMapping
    - Unicode.Internal.Char.Version
  - Unicode.Internal.Division
  - Unicode.Internal.Unfold

Depends on 1 package(full list with versions):

base

Used by 3 packages in nightly-2024-10-31(full list with versions):

commonmark, streamly, unicode-transforms

README

unicode-data provides Haskell APIs to efficiently access the Unicode character database. Performance is the primary goal in the design of this package.

The Haskell data structures are generated programmatically from the Unicode character database (UCD) files. The latest Unicode version supported by this library is 15.1.0.

Please see the Haddock documentation for reference documentation.

Performance

unicode-data is up to 5 times faster than base ≤ 4.17 (see partial integration to base).

The following benchmark compares the time taken in milliseconds to process all the Unicode code points (except surrogates, private use areas and unassigned), for base-4.16 (GHC 9.2.6) and this package (v0.4). Machine: 8 × AMD Ryzen 5 2500U on Linux.

All
  Unicode.Char.Case.Compat
    isLower
      base:           OK (1.19s)
        17.1 ms ± 241 μs
      unicode-data:   OK (0.52s)
        3.58 ms ± 125 μs, 0.21x
    isUpper
      base:           OK (0.63s)
        17.5 ms ± 359 μs
      unicode-data:   OK (1.02s)
        3.58 ms ±  48 μs, 0.21x
    toLower
      base:           OK (0.59s)
        16.3 ms ± 524 μs
      unicode-data:   OK (0.80s)
        5.63 ms ± 129 μs, 0.35x
    toTitle
      base:           OK (3.91s)
        14.9 ms ± 427 μs
      unicode-data:   OK (2.84s)
        5.31 ms ±  37 μs, 0.36x
    toUpper
      base:           OK (2.12s)
        15.4 ms ± 234 μs
      unicode-data:   OK (0.86s)
        5.80 ms ± 159 μs, 0.38x
  Unicode.Char.General
    generalCategory
      base:           OK (1.16s)
        16.6 ms ± 534 μs
      unicode-data:   OK (0.62s)
        4.14 ms ± 103 μs, 0.25x
    isAlphaNum
      base:           OK (0.62s)
        17.1 ms ± 655 μs
      unicode-data:   OK (0.97s)
        3.59 ms ±  51 μs, 0.21x
    isControl
      base:           OK (0.63s)
        17.6 ms ± 494 μs
      unicode-data:   OK (0.57s)
        3.59 ms ±  90 μs, 0.20x
    isMark
      base:           OK (0.34s)
        17.6 ms ± 695 μs
      unicode-data:   OK (1.00s)
        3.59 ms ±  67 μs, 0.20x
    isPrint
      base:           OK (1.22s)
        17.7 ms ± 492 μs
      unicode-data:   OK (1.92s)
        3.56 ms ±  27 μs, 0.20x
    isPunctuation
      base:           OK (2.23s)
        16.6 ms ± 619 μs
      unicode-data:   OK (1.05s)
        3.60 ms ±  52 μs, 0.22x
    isSeparator
      base:           OK (1.15s)
        16.6 ms ± 439 μs
      unicode-data:   OK (0.49s)
        3.60 ms ±  85 μs, 0.22x
    isSymbol
      base:           OK (2.11s)
        16.1 ms ± 553 μs
      unicode-data:   OK (1.05s)
        3.58 ms ±  62 μs, 0.22x
  Unicode.Char.General.Compat
    isAlpha
      base:           OK (0.58s)
        17.2 ms ± 502 μs
      unicode-data:   OK (1.02s)
        3.58 ms ±  50 μs, 0.21x
    isLetter
      base:           OK (8.57s)
        16.4 ms ± 553 μs
      unicode-data:   OK (1.05s)
        3.58 ms ±  79 μs, 0.22x
    isSpace
      base:           OK (1.09s)
        7.56 ms ± 159 μs
      unicode-data:   OK (0.97s)
        3.58 ms ±  46 μs, 0.47x
  Unicode.Char.Numeric.Compat
    isNumber
      base:           OK (0.58s)
        15.7 ms ± 462 μs
      unicode-data:   OK (0.58s)
        3.58 ms ± 107 μs, 0.23x

Partial integration of `unicode-data` into `base`

Since base 4.18, unicode-data has been partially integrated to GHC, so there should be no relevant difference. However, using unicode-data allows to select the exact version of Unicode to support, therefore not relying on the version supported by GHC.

Unicode database version update

To update the Unicode version please update the version number in ucd.sh.

To download the Unicode database, run ucd.sh download from the top level directory of the repo to fetch the database in ./ucd.

$ ./ucd.sh download

To generate the Haskell data structure files from the downloaded database files, run ucd.sh generate from the top level directory of the repo.

$ ./ucd.sh generate

Running property doctests

Temporarily add QuickCheck to build depends of library.

$ cabal build
$ cabal-docspec --check-properties --property-variables c

Licensing

unicode-data is an open source project available under a liberal Apache-2.0 license.

Contributing

As an open project we welcome contributions.

Changes

Changelog

0.6.0 (July 2024)

Updated to Unicode 15.1.0.
Added showCodePoint to Unicode.Char.
Added intToDigiT to Unicode.Char.Numeric.

Removed

Removed deprecated isLetter and isSpace from Unicode.Char.General. Use the corresponding functions from Unicode.Char.General.Compat instead.
Remove deprecated isLower and isUpper from Unicode.Char.Case. Use the corresponding functions from Unicode.Char.Case.Compat instead.
Removed deprecated Unicode.Char.Numeric.isNumber. Use Unicode.Char.Numeric.Compat.isNumber instead.

Deprecations

Unicode.Char.General.isAlphaNum. Use Unicode.Char.General.Compat.isAlphaNum instead.

0.5.0 (July 2024)

Fix the inlining of Addr# literals and reduce their size. This results in a sensible decrease of the executable size.
Changed integerValue from Char -> Maybe Int to (Integral a) => Char -> Maybe a.

0.4.0.1 (December 2022)

Fix Unicode blocks handling on big-endian architectures.

0.4.0 (September 2022)

Update to Unicode 15.0.0.

0.3.1 (September 2022)

Added full case conversions to Unicode.Char.Case:
- Case folding: caseFoldMapping and toCaseFoldString.
- Lower case: lowerCaseMapping and toLowerString.
- Upper case: upperCaseMapping and toUpperString.
- Title case: titleCaseMapping and toTitleString.
- Stream mechanism: Unfold and Step.
Added isNumeric, numericValue and integerValue to Unicode.Char.Numeric.
Added the module Unicode.Char.General.Blocks.
Add compatibility module:
- Unicode.Char.Numeric.Compat

Deprecations

Unicode.Char.Numeric.isNumber: it will be replaced by isNumeric in a future version of this package. Use the function in Unicode.Char.Numeric.Compat instead.

0.3.0 (December 2021)

Support for big-endian architectures.
Added unicodeVersion.
Added GeneralCategory data type and corresponding generalCategoryAbbr, generalCategory functions.
Added the following functions to Unicode.Char.General: isAlphabetic, isAlphaNum, isControl, isMark, isPrint, isPunctuation, isSeparator, isSymbol and isWhiteSpace.
Added the module Unicode.Char.Numeric.
Add compatibility modules:
- Unicode.Char.General.Compat
- Unicode.Char.Case.Compat
These modules are compatible with base:Data.Char.
Re-export some functions from Data.Char in order to make Unicode.Char a drop-in replacement in a future version of this package.
Drop support for GHC 7.10.3

Deprecations

In Unicode.Char.Case:
- isUpper: use isUpperCase instead.
- isLower: use isLowerCase instead.
In Unicode.Char.General:
- isLetter: use isAlphabetic instead.
- isSpace: use isWhiteSpace instead.
In Unicode.Char: same as hereinabove. These functions will be replaced in a future release with the functions with the same names from Unicode.Char.Case.Compat and Unicode.Char.General.Compat.

0.2.0 (November 2021)

Update to Unicode 14.0.0.
Add Unicode.Char.Identifiers supporting Unicode Identifier and Pattern Syntax.

0.1.0.1 (Jul 2021)

Workaround to avoid incorrect display of dependencies on Hackage by moving build-depends of ucd2haskell executable under a build flag conditional.

0.1.0 (Jul 2021)

Initial release

unicode-data

Module documentation for 0.6.0

README

Performance

Partial integration of unicode-data into base

Unicode database version update

Running property doctests

Licensing

Contributing

Changes

Changelog

0.6.0 (July 2024)

Removed

Deprecations

0.5.0 (July 2024)

0.4.0.1 (December 2022)

0.4.0 (September 2022)

0.3.1 (September 2022)

Deprecations

0.3.0 (December 2021)

Deprecations

0.2.0 (November 2021)

0.1.0.1 (Jul 2021)

0.1.0 (Jul 2021)

Partial integration of `unicode-data` into `base`