Frames
Data frames for working with tabular data files
LTS Haskell 23.1: | 0.7.4.2 |
Stackage Nightly 2024-12-09: | 0.7.4.2 |
Latest on Hackage: | 0.7.4.2 |
Frames-0.7.4.2@sha256:51f42e242535a12cef1fbc1059c3936110489fb098f4f56fcb1e0729eace37be,9493
Frames
Data Frames for Haskell
User-friendly, type safe, runtime efficient tooling for working with tabular data deserialized from comma-separated values (CSV) files. The type of each row of data is inferred from data, which can then be streamed from disk, or worked with in memory.
We provide streaming and in-memory interfaces for efficiently working with datasets that can be safely indexed by column names found in the data files themselves. This type safety of column access and manipulation is checked at compile time.
Use Cases
For a running example, we will use variations of the prestige.csv data set. Each row includes 7 columns, but we just want to compute the average ratio of income
to prestige
.
Clean Data
If you have a CSV data where the values of each column may be classified by a single type, and ideally you have a header row giving each column a name, you may simply want to avoid writing out the Haskell type corresponding to each row. Frames
provides TemplateHaskell
machinery to infer a Haskell type for each row of your data set, thus preventing the situation where your code quietly diverges from your data.
We generate a collection of definitions generated by inspecting the data file at compile time (using tableTypes
), then, at runtime, load that data into column-oriented storage in memory with a row-oriented interface (an in-core array of structures (AoS)). We’re going to compute the average ratio of two columns, so we’ll use the foldl
library. Our fold will project the columns we want, and apply a function that divides one by the other after appropriate numeric type conversions. Here is the entirety of that program.
{-# LANGUAGE DataKinds, FlexibleContexts, QuasiQuotes, TemplateHaskell, TypeApplications #-}
module UncurryFold where
import qualified Control.Foldl as L
import Data.Vinyl.Curry ( runcurryX )
import Frames
-- Data set from http://vincentarelbundock.github.io/Rdatasets/datasets.html
tableTypes "Row" "test/data/prestige.csv"
loadRows :: IO (Frame Row)
loadRows = inCoreAoS (readTable "test/data/prestige.csv")
-- | Compute the ratio of income to prestige for a record containing
-- only those fields.
ratio :: Record '[Income, Prestige] -> Double
ratio = runcurryX (\i p -> fromIntegral i / p)
averageRatio :: IO Double
averageRatio = L.fold (L.premap (ratio . rcast) avg) <$> loadRows
where avg = (/) <$> L.sum <*> L.genericLength
Missing Header Row
Now consider a case where our data file lacks a header row (I deleted the first row from `prestige.csv`). We will provide our own name for the generated row type, our own column names, and, for the sake of demonstration, we will also specify a prefix to be added to every column-based identifier (particularly useful if the column names do come from a header row, and you want to work with multiple CSV files some of whose column names coincide). We customize behavior by updating whichever fields of the record produced by rowGen
we care to change, passing the result to tableTypes'
. Link to code.
{-# LANGUAGE DataKinds, FlexibleContexts, QuasiQuotes, TemplateHaskell, TypeApplications #-}
module UncurryFoldNoHeader where
import qualified Control.Foldl as L
import Data.Vinyl.Curry ( runcurryX )
import Frames
import Frames.TH ( rowGen
, RowGen(..)
)
-- Data set from http://vincentarelbundock.github.io/Rdatasets/datasets.html
tableTypes' (rowGen "test/data/prestigeNoHeader.csv")
{ rowTypeName = "NoH"
, columnNames = [ "Job", "Schooling", "Money", "Females"
, "Respect", "Census", "Category" ]
, tablePrefix = "NoHead"}
loadRows :: IO (Frame NoH)
loadRows = inCoreAoS (readTableOpt noHParser "test/data/prestigeNoHeader.csv")
-- | Compute the ratio of money to respect for a record containing
-- only those fields.
ratio :: Record '[NoHeadMoney, NoHeadRespect] -> Double
ratio = runcurryX (\m r -> fromIntegral m / r)
averageRatio :: IO Double
averageRatio = L.fold (L.premap (ratio . rcast) avg) <$> loadRows
where avg = (/) <$> L.sum <*> L.genericLength
Missing Data
Sometimes not every row has a value for every column. I went ahead and blanked the prestige
column of every row whose type
column was NA
in prestige.csv
. For example, the first such row now reads,
"athletes",11.44,8206,8.13,,3373,NA
We can no longer parse a Double
for that row, so we will work with row types parameterized by a Maybe
type constructor. We are substantially filtering our data, so we will perform this operation in a streaming fashion without ever loading the entire table into memory. Our process will be to check if the prestige
column was parsed, only keeping those rows for which it was not, then project the income
column from those rows, and finally throw away Nothing
elements. Link to code.
{-# LANGUAGE DataKinds, FlexibleContexts, QuasiQuotes, TemplateHaskell, TypeApplications, TypeOperators #-}
module UncurryFoldPartialData where
import qualified Control.Foldl as L
import Data.Maybe (isNothing)
import Data.Vinyl.XRec (toHKD)
import Frames
import Pipes (Producer, (>->))
import qualified Pipes.Prelude as P
-- Data set from http://vincentarelbundock.github.io/Rdatasets/datasets.html
-- The prestige column has been left blank for rows whose "type" is
-- listed as "NA".
tableTypes "Row" "test/data/prestigePartial.csv"
-- | A pipes 'Producer' of our 'Row' type with a column functor of
-- 'Maybe'. That is, each element of each row may have failed to parse
-- from the CSV file.
maybeRows :: MonadSafe m => Producer (Rec (Maybe :. ElField) (RecordColumns Row)) m ()
maybeRows = readTableMaybe "test/data/prestigePartial.csv"
-- | Return the number of rows with unknown prestige, and the average
-- income of those rows.
incomeOfUnknownPrestige :: IO (Int, Double)
incomeOfUnknownPrestige =
runSafeEffect . L.purely P.fold avg $
maybeRows >-> P.filter prestigeUnknown >-> P.map getIncome >-> P.concat
where avg = (\s l -> (l, s / fromIntegral l)) <$> L.sum <*> L.length
getIncome = fmap fromIntegral . toHKD . rget @Income
prestigeUnknown :: Rec (Maybe :. ElField) (RecordColumns Row) -> Bool
prestigeUnknown = isNothing . toHKD . rget @Prestige
Tutorial
For comparison to working with data frames in other languages, see the tutorial.
Demos
There are various demos in the repository. Be sure to run the getdata
build target to download the data files used by the demos! You can also download the data files manually and put them in a data
directory in the directory from which you will be running the executables.
Contribute
You can build Frames via nix with the following command:
nix build .#Frames-8107 # or nix build .#Frames-921
this creates an ./result link in the current folder.
To get a development shell with all libraries, you can run:
nix develop .#Frames-921
To get just ghc and cabal in your shell, a simple nix develop
will do.
Benchmarks
The benchmark shows several ways of dealing with data when you want to perform multiple traversals.
Another demo shows how to fuse multiple passes into one so that the full data set is never resident in memory. A Pandas version of a similar program is also provided for comparison.
This is a trivial program, but shows that performance is comparable to Pandas, and the memory savings of a compiled program are substantial.
First with Pandas,
$ nix-shell -p 'python3.withPackages (p: [p.pandas])' --run '$(which time) -f "%Uuser %Ssystem %Eelapsed %PCPU; %Mmaxresident KB" python benchmarks/panda.py'
28.087476512228815
-81.90356506136422
0.67user 0.04system 0:00.72elapsed 99%CPU; 79376maxresident KB
Then with Frames,
$ $(which time) -f '%Uuser %Ssystem %Eelapsed %PCPU; %Mmaxresident KB' dist-newstyle/build/x86_64-linux/ghc-8.10.4/Frames-0.7.2/x/benchdemo/build/benchdemo/benchdemo
28.087476512228815
-81.90356506136422
0.36user 0.00system 0:00.37elapsed 100%CPU; 5088maxresident KB
Changes
0.7.4
Replace the htoml
package used in a test with tomland
.
0.7.2
- Add
writeCSVopts
that accepts options to specify the CSV delimiter. - Add
inferencePrefix
that controls how many lines of the input file are used for column type inference (default is 1000). - Add
readTableDebug
that loads and parses a data frame asreadTable
, but additionally prints lines that failed to parse tostderr
.
0.7.1
-
Add
showFrame
,printFrame
,takeRows
, anddropRows
to theFrames.Exploration
module. These helpers for working withFrames
are re-exported from theFrames
module itself. Thanks to @chfin. -
GHC-9.0.1 support.
0.7.0
GHC-8.10 support in Vinyl requires a major version bump.
0.6.3
- Fix support for categorical column names that include spaces (@epn09)
0.6.0
Support external CSV tokenizers
Internal functionality is now defined more cleanly atop a stream of rows already broken into columns (rather than a stream of rows that we quietly break into columns ourself). This permits the use of external parsers such as provided by the new Frames-dsv package that supplies a CSV parser built atop hw-dsv
.
The built-in CSV parser remains for ease of installation.
0.5.1
GHC 8.6 compatibility
0.5.0
-
Renamed the
rgetf
andrputf
exported by theFrames
module torgetField
andrputField
. This avoids clashing with the same names exported byvinyl
and further advances the process of eliminating the oldFrames
Col
type in favor ofvinyl
’sElField
. -
Add a
ShowCSV
class rather than leaning on overburdenedShow
instances. -
Add support for categorical column types: values of these types are one of a small number of textual values. Because they can only take on a small number of different text values, we can compactly represent values of these types as standard Haskell sum types.
0.4.0
-
Added table joins in
Data.Vinyl.Joins
(Chris Hammill) -
Changed types of
mapMethod
andmapMethodV
These now rely on explicit TypeApplications
rather than Proxy
values.
0.3.0
- Pervasive use of
pipes
for CSV data loading
This provides better exception handling (file handles should be closed more reliably), and offers an interface point for customized handling of input texts. An example of this latter point is working with particular file encodings.
A breaking change is that operations that previously returned IO
values now return MonadSafe
constrained values.
- Adaptation of
Data.Vinyl.Curry.runcurry
to the FramesRecord
type This simply strips the column name information from a row before applying the function fromvinyl
.
0.2.1
-
Refactored to use the
CoRec
type provided byvinyl
>= 0.6.0 -
Fixed bug in typing mostly-numeric columns Such columns must be represented as
Text
. Previously, we strove a bit too hard to avoid falling back toText
resulting in dropping rows containing non-numeric values for columns we crammed into a numeric type. -
Minor optimization of CSV parsing In particular, dealing with RFC4180 style quoting
-
GHC-8.2.1 compatibility
0.1.10
- Added CSV output functions:
produceCSV
andwriteCSV
- Added an Eq instance for the
Frame
type
0.1.9
Fixed column type inference bug that led the inferencer to prefer Bool
too strongly.
This was fallout from typing columns whose values are all 0 or 1 as Bool
.
0.1.6
Re-export Frames.CSV.declareColumn
from Frames
. This makes it much
easier to manually define column types.
0.1.4
Use microlens
instead of lens-family-core
for demos.
0.1.3
GHC-8.0.1 compatibility
0.1.2.1
Improved documentation based on suggestions by Alexander Kjeldaas
0.1.2
Fixed bug in Monoid
instance of Frame
(@dalejordan)
0.1.1.0
Added frameConsA
, frameSnoc
, and RecordColumns
to help with
changing row types.
0.1.0.0
Initial version pushed to hackage.