construct

Haskell version of the Construct library for easy specification of file formats

Version on this page:0.3.1.2
LTS Haskell 23.4:0.3.2
Stackage Nightly 2024-12-09:0.3.1.2
Latest on Hackage:0.3.2

See all snapshots construct appears in

BSD-3-Clause licensed by Mario Blažević
Maintained by Mario Blažević
This version can be pinned in stack with:construct-0.3.1.2@sha256:1237b4e0e7aac7008d990b497191174e993e712715d9de3c731762d74f8a5d48,3019

Module documentation for 0.3.1.2

Construct.hs

This is a Haskell implementation of Python’s Construct library. It provides a succinct and easy way to specify data formats. Before you get to the succinct part, though, you’ll probably need a bunch of extensions and imports:

{-# LANGUAGE FlexibleInstances, StandaloneDeriving, TemplateHaskell #-}

module README where

import Data.Functor.Identity (Identity(Identity))
import Data.Word (Word8)
import Data.ByteString (ByteString)
import qualified Data.ByteString as ByteString
import qualified Data.ByteString.Char8 as ASCII
import qualified Rank2.TH
import Text.ParserCombinators.Incremental.LeftBiasedLocal (Parser, completeResults, feed, feedEof)
import Construct

import Prelude hiding ((*>), (<*))

Example

With that out of the way, let’s take the simple example format from the original. Here’s what its specification looks like in Haskell:

data BitMap f = BitMap{
   width :: f Word8,
   height :: f Word8,
   pixels :: f [[Word8]]
   }
deriving instance Show (BitMap Identity)
$(Rank2.TH.deriveAll ''BitMap)

format :: Format (Parser ByteString) Maybe ByteString (BitMap Identity)
format = literal (ASCII.pack "BMP") *> mfix (\this-> record
  BitMap{
        width= byte,
        height= byte,
        pixels= count (fromIntegral $ height this) $
                count (fromIntegral $ width this) byte
        })

There are two parts to the specification.

The data BitMap declaration specifies the in-memory layout of a simple bitmap. Note that it’s declared as a record with every field wrapped in a type constructor parameter. This declaration style is sometimes called Higher-Kinded Data. To regain a regular record, just instantiate the parameter to Identity — a BitMap Identity contains exactly one value of each field.

The other part of the specification is the format definition that specifies the bi-directional mapping between the in-memory and the serialized form of the bitmap. A bijection, to be precise. The two definitions are enough to automatically serialize the in-memory record form into the binary form:

-- |
-- >>> let record = BitMap{width= pure 3, height= pure 2, pixels= pure [[7,8,9], [11,12,13]]}
-- >>> serialize format record
-- Just "BMP\ETX\STX\a\b\t\v\f\r"

and to parse the serialized binary form back into the record structure:

-- |
-- >>> let bytes = ASCII.pack "BMP\ETX\STX\a\b\t\v\f\r"
-- >>> completeResults $ feedEof $ feed bytes $ parse format
-- [(BitMap {width = Identity 3, height = Identity 2, pixels = Identity [[7,8,9],[11,12,13]]},"")]

Examples of more complex and realistic formats can be found in the test directory.

Acknowledgements

I owe the inspiration for this library to Yair Chuchem and his post that introduced me to Construct. I must also express gratitude to the authors of the original library of course. And finally, to the authors of the paper Invertible Syntax Descriptions:Unifying Parsing and Pretty Printing which I remembered reading just in time to avoid following some bad ideas.

Implementation notes

I had to overcome two problems in the course of the implementation. The first difficulty, already touched on in the aforementioned blog post, is how to convert a record of formats into a format of the record. As the author of rank2classes, I went for an obvious solution: parameterize the record as seen in the example, make it an instance of the Rank2.Traversable class, and apply Rank2.traverse to it.

That little trick alone would have been enough for a nearly complete package, except the Python library also enables a record field to refer to any of the preceding fields. This is not a problem when serializing, but when parsing it means that a parser must be able to refer to an already-parsed part of the value that’s still being parsed.

The standard solution to the problem of accessing the results (parsed values) of a computation (parsing) while within the computation is known as the MonadFix class, though perhaps not as widely as it deserves. You won’t find any parsers in the list of its instances, though, and for a good reason - it’s quite impossible for a parser to obtain a value that it hasn’t parsed yet. All we need, luckily, is access an already parsed part of a partially parsed structure. A limited MonadFix instance can do that, provided that we also use a specialized form of Rank2.traverse capable of preserving the record structure while it’s being parsed.

As it happens, I have an old parser library named incremental-parser on Hackage, and the name seems quite appropriate for what I’m doing here. I added the necessary functionality there, but another parser combinator library should be capable of the same feat. It just needs to implement the mfix and fixSequence combinators and it can be used with the present library.

Changes

Revision history for construct

0.3.1.2 – 2023-12-19

  • Fixed Cabal warnings
  • Increment the bytestring, text, monoid-sublasses, and rank2classes dependencies’ upper bounds

0.3.1.1 – 2022-10-03

  • Increment the monoid-sublasses and input-parsers dependencies’ upper bounds

0.3.1 – 2022-03-25

  • Increment the text dependency upper bound
  • Move the Template Haskell splices so they compile with GHC 9

0.3.0.2 – 2021-03-22

  • Increment the attoparsec dependency upper bounds

0.3.0.1 – 2021-03-07

  • Increment the dependencies’ upper bounds

0.3 – 2020-07-18

  • Import the Input[Char]Parser classes from the input-parsers package

0.2.0.1 – 2020-03-08

  • Incremented the upper bound for the rank2classes dependency

0.2 – 2020-01-27

  • Updated for base-4.13
  • Eliminated the Monoid/Semigroup (n s) constraints
  • Added recordWith
  • Fixed the URI example and test
    • Constrained the URI format serialization to ByteString
    • Switched the URI parser to Attoparsec

0.1 – 2020-01-22

  • First version. Released on an unsuspecting world.