λ-view: On the design of an efficient builder monoid

Lately (here and here), I reported on promising results about the efficient construction of lazy bytestrings with a large average chunk size using the blaze-builder library. In the meantime, I have done some more research, which further confirmed the strength of the design underlying the blaze-builder library. However before reporting on more results, I'd like to present you its design in the hope that you can apply the underlying ideas to other performance critical areas you are working on.

The Foreign module provides us with all the functions we require for working with pointers of various forms. This includes the ForeignPtr pointers used by strict bytestrings to reference their underlying buffer.

  import Foreign
  import Data.Monoid
  import qualified Data.ByteString as S
  import qualified Data.ByteString.Internal as S

A builder should represent a sequence of bytes such that the following two objectives are maximized:

the efficiency of writing the represented sequence of bytes to a sequence of buffers (e.g., the chunks of a lazy bytestring) and
the efficiency of appending two builders.

In our design, the work of a builder is handled by build steps. Once we tell a build step where it can start to write and where the buffer ends, the build step will write all the bytes it represents and return a build signal that tells us how to proceed.

If the build signal is Done pf, then the build step has completed its work and the next free byte in the buffer is pointed to by pf.
If the build signal is BufferFull requiredSize pf nextStep, then the build step has filled the buffer up to pf and now requires a new buffer with at least the size requiredSize. If we were creating a lazy bytestring, we would now ship of the full buffer as a chunk and allocate a new buffer that we can pass to nextStep.
The build signal InsertByteString pf bs nextStep, tells us that the build step has filled the buffer up to pf and would now like to insert a bytestring directly into the output sequence of buffers. The idea behind this signal is that it allows us to avoid copying large bytestrings.

A builder is just a build step parametrized over the build step that should be executed after the builder has done its output.

  type BuildStep = Ptr Word8 -- first free byte of buffer
-> Ptr Word8 -- first byte after buffer
-> IO BuildSignal

  data BuildSignal = -- next free byte
   Done !(Ptr Word8)
   | BufferFull !Int !(Ptr Word8) !BuildStep
   | InsertByteString !(Ptr Word8) !S.ByteString !BuildStep

  newtype Builder = Builder (BuildStep -> BuildStep)

Based on these definitions, we can easily define a Monoid instance that models concatenation of builders. An empty builder just returns the build step to be executed afterwards. We append two builders by telling the first builder that it should call the second builder once its done and by telling the second builder that it should call the continuation builder k once it is done.

  instance Monoid Builder where
   mempty = Builder id
   mappend (Builder b1) (Builder b2) = Builder $ \k -> b1 (b2 k)

Writing an actual builder requires some care, as we are working in the IO monad and, hence, the safety belts are off. We describe the construction of builders that serialize values that do not have to be wrapped over buffer boundaries using the notion of a Write. A value Write size io denotes an atomic write to a buffer of size bytes, which can be executed by a call io with a pointer to the first byte that should be written.

  data Write = Write Int (Ptr Word8 -> IO ())

  writeWord8 :: Word8 -> Write
  writeWord8 x = Write 1 (\pf -> poke pf x)

Defining writes is simple. Constructing a builder from a write is also not difficult once you have understood how to define a function such that it can call the BufferFull signal with a reference to itself.

  fromWrite :: Write -> Builder
  fromWrite (Write size io) =
   Builder step
   where
   step !k !pf !pe
   | pf' <= pe = do io pf
   k pf' pe
   | otherwise = do return $ BufferFull size pf (step k)
   where
   pf' = pf `plusPtr` size

Once such basic builder constructors are defined, the code quickly looses it's C-like style, while retaining good performance.

  fromWord8 :: Word8 -> Builder
  fromWord8 = fromWrite . writeWord8

  fromWord8s :: [Word8] -> Builder
  fromWord8s = mconcat . map fromWord8

  buildABC :: Builder
  buildABC = fromWord8s [65..90]

The missing piece is now a driver function that actually runs a builder. A very nice property of our builder design is that it completely decouples the allocation strategy for the output buffers from the actual writing to them. Hence, whatever buffer you have ready, you can tell a builder to fill it. The only caveat is that a build step may require a large buffer than you can provide. However, all builders I implemented up to now can be wrapped at almost every point. Therefore, they would even work with a very small output buffer and the above caveat likely never applies; i.e., expensive solutions to get rid of it are OK.

An additional advantage of separating the buffer allocation strategy from the buffer filling is that the whole state of the allocation strategy is kept out of the performance critical functions; i.e., the concatenation of builders and their implementation itself. This is one of the core differences to the builder implementation provided by Data.Binary.Builder from the binary package.

The driver I implement here runs a builder and executes an IO action on each full buffer, which we represent by a strict bytestring. We can for example use this driver to send the bytes represented by a builder over the network using Network.Socket.ByteString. This driver is slightly simpler to implement than the generation of lazy bytestrings. Moreover, this driver has the nice property that no unsafePerformIO is involved, which means that there are no other semantic pitfalls than the ones already provided by the IO monad ;-)

  toByteStringIOWith :: Int -> (S.ByteString -> IO ()) -> Builder -> IO ()
  toByteStringIOWith bufSize io (Builder b) =
   fillBuffer bufSize (b finalStep)
   where
   finalStep pf _ = return $ Done pf
   fillBuffer !size step = do
   S.mallocByteString size >>= fill
   where
   fill fpbuf = do
   let !pf = unsafeForeignPtrToPtr fpbuf
   -- safe due to later reference of fpbuf
   -- BETTER than withForeignPtr, as we lose a tail call otherwise
   signal <- step pf (pf `plusPtr` size)
   case signal of
   Done pf' -> io $ S.PS fpbuf 0 (pf' `minusPtr` pf)
   BufferFull minSize pf' nextStep -> do
   io $ S.PS fpbuf 0 (pf' `minusPtr` pf)
   fillBuffer (max bufSize minSize) nextStep
   InsertByteString pf' bs nextStep -> do
   io $ S.PS fpbuf 0 (pf' `minusPtr` pf)
   io $ bs
   fillBuffer bufSize nextStep

The implementation is rather straightforward. We construct a final step that we can hand over to the builder. Then we allocate a buffer and fill it using the given step. We handle the signals as described above and recurse using a tail call to fill the next buffer. The tail call is important to get good performance and to not waste stack space. A misplaced (outermost) withForeignPtr can easily destroy this property. Therefore, I switched to using unsafeForeignPtrToPtr and reasoning directly about when it is safe to extract the pointer of a foreign pointer.

Lets test our driver with an artificially small chunk size.

  testABC :: IO ()
  testABC = toByteStringIOWith 10 (putStrLn . show) buildABC

  *BuilderDesign> testABC
  "ABCDEFGHIJ"
  "KLMNOPQRST"
  "UVWXYZ"

Yihaa, nice :-) Now, you have seen the core design of the builder monoid provided by he blaze-builder library. It gains a lot of its performance from its simplicity and the option to safely break the abstraction and implement build steps directly when needed. What I was hiding in the above definitions are the {-# UNPACK #-} pragmas and a (not yet so judicious) use of {-# INLINE #-} pragmas. See the source and the accompanying benchmarks of the blaze-builder library for more information on them.

Let me give two last examples of how we improve performance without loosing code maintainability. The key to these examples is the Write abstraction. Writing of a list of elements calls for a typical optimization: move invariant data out of the inner loop. We can implement it once and for all as follows.

  fromWriteList :: (a -> Write) -> [a] -> Builder
  fromWriteList write =
   \xs -> Builder $ step xs
   where
   step xs0 !k !pf0 !pe0 = go xs0 pf0
   where
   go [] !pf = k pf pe0
   go xs@(x':xs') !pf
   | pf' <= pe0 = io pf >> go xs' pf'
   | otherwise = return $ BufferFull size pf (step xs k)
   where
   !pf' = pf `plusPtr` size
   Write size io = write x'

Using inlining, we can give the compiler enough information such that it can optimize the code for the actual implementations of calls like fromWriteList writeWord8.

Another optimization often used in the blaze-builder library is to ensure that costly operations are amortized over enough bytes being output. For example, this is why it is OK for buffer wrapping to be a bit more expensive. For consecutive writes, we can reduce the cost of appending them using the following Monoid instance.

  instance Monoid Write where
   mempty = Write 0 (const $ return ())
   mappend (Write l1 f1) (Write l2 f2) = Write (l1 + l2) $ \ptr -> do
   f1 ptr
   f2 (ptr `plusPtr` l1)

If the compiler knows all the implementations of the writes, then he will produce very nice straight-line code for builders constructed from mappend'ed writes. A typical example would be the serialization of a Word16 represented by two bytes b1 and b2, as fromWrite (writeWord8 b0 `mappend` writeWord8 b1).

Well, that's it for now. I hope you did enjoy this excursion into some not always Haskell-like code. For me, the development of this library was (and still is) very entertaining and inspiring and I'd love to hear about how it fares in practice.

λ-view

Thursday, November 18, 2010

On the design of an efficient builder monoid

No comments:

Post a Comment