In the last post, I introduced the blaze-builder library and stressed the point that it is important to ensure a large average chunk size for the constructed lazy bytestrings. In this post, I'll give a concrete example of why ensuring a large average chunk size matters.
You probably know the nice zlib library that allows you to compress a lazy bytestring with a single call to
compress :: L.ByteString -> L.ByteString
I will use compress to illustrate that small chunk sizes can be costly. Actually, they can be so costly that it is worth it to first "defragment" the lazy bytestring before compressing it. Using the blaze-builder library, defragmentation is easily defined as follows.
defragment :: L.ByteString -> L.ByteString
defragment = toLazyByteString . fromLazyByteString
The builder created using fromLazyByteString copies the chunks up to a size of 8kb and insert them directly in the output stream otherwise. This way we can guarantee a minimal average chunk size of 4kb no matter when the output buffer is flushed due to a direct insertion of an 8kb block.
The following plot shows the measured times in boxplot format for defragmenting ("compaction only"), direct compression, and compression with preceding compaction of 200kb of data represented as a lazy bytestring for different fixed chunk sizes.
As in my previous post, the benchmarks were measured on a Core2 Duo T7500 with 2GB RAM and Linux 2.6.32-24 i686 and GHC 6.12.3. The corresponding measurement log can be found here. A log-log plot exhibiting more information on the behaviour of defragment for larger chunk sizes can be found here.
The above plot shows that compress profits heavily from defragmentation. Sadly, I do not yet know the cause for the significant slowdown of compress for lazy bytestrings with a small average chunk size. I guess it is a combined effect of the cost of the FFI calls (how big are they actually?) and perhaps some implementation overhead stemming from large amount of state being threaded through the implementation of compress. Comments and clarifications are very much welcome.
Note that as for filesystems, an even better solution than using regular defragmentation is to avoid fragmentation in the first place, which you can achieve by using a Builder for constructing lazy bytestrings. Note also that fromLazyByteString currently does not wrap bytestrings around buffer boundaries, which results in some unnecessarily spilled memory for medium (1kb - 8kb) chunk sizes. I'll implement that in the near future and will post the new measurements here.
Showing posts with label blaze-builder. Show all posts
Showing posts with label blaze-builder. Show all posts
Wednesday, November 10, 2010
Sunday, November 7, 2010
The blaze-builder library: faster construction of bytestrings
Hi, I am Simon Meier, a swiss Haskell enthusiast currently pursuing his PhD in computer science at ETH Zurich. In this blog post, I'll introduce you to the blaze-builder library.
The blaze-builder library provides you with a Builder type that you can use to efficiently construct sequences of bytes represented in a packed form as a strict or lazy bytestring. Hence, typical use cases for a Builder are saving your application data in a space efficient binary form to a file or sending a response to some request over the network.
Probably, you know about the binary package, which also provides a Builder type in the Data.Binary.Builder module targeting exactly the same usecase as our Builder. This is no coincidence. During this year's Google Summer of Code, Jasper Van der Jeugt and I developed the blaze-builder library to overcome performance shortcomings of Data.Binary.Builder with respect to the specific needs of the blaze-html HTML generation library. Since then, I have restructured the blaze-builder library to serve as a drop-in replacement for Data.Binary.Builder, which it improves upon with respect to both speed as well as expressivity.
Usage exampleWe start by importing the necessary modules. We also define a convenient abbreviation for mappend, which actually will become part of the base library according to rumors I heard at this years ZuriHac.
Our example is about serializing a very simple representation of a person to a sequence of bytes. As usual, this serialization also requires us to fix the encoding format. We encode strings using UTF-8 and prefix them with their length encoded as a 32bit little-endian integer to make parsing unambiguous. We also encode the age of a person as a 32bit little-endian integer. I guess the code speaks for itself.
The above code is typical for serialization code based on builders. One uses the predefined functions for creating builders with a fixed encoding format from standard Haskell values. These builders are then combined using the functions from the Monoid typeclass. Builders essentially store the recipe for building their corresponding sequence of bytes. Once one needs a concrete representation of this sequence of bytes, one just calls toLazyByteString or toByteString to execute that recipe.
The benefit of using builders to construct a bytestring is twofold: First, appending two builders is an O(1) operation, which is also efficient in absolute terms, as it corresponds to a single function call. Second, when constructing the resulting lazy bytestring the blaze-builder makes sure that the average chunk size is large. A large average chunk size is important to make good use of cache prefetching in later processing steps (e.g. compression) and it also reduces the sytem call overhead when writing the resulting lazy bytestring to a file or sending it over the network.
For example, the above code results in the following sequence of chunk sizes.
The 170001 bytes represented by lazyBinaryCloneVillage feature an average chunk size of ~24kb. The first buffer is only ~4kb large, because for short output sequences the buffer allocation cost is significant. toLazyByteString compensates this cost by allocating the first buffer with the minimal expected chunk size. Note that these chunk sizes reflect the default settings of toLazyByteString, which is optimized to yield efficient and well-chunked results for all lengths of output sequences. If you know more about your typical serialization tasks, then you can tune these settings to your favor.
Speaking of efficiency, I'm quite sure you would also like to see some benchmark figures. I'm not going to present the figures for the above example. Not because they are embarassing; they are not. However, without good competition, the interpretation of benchmark figures is difficult; and currently, I don't know of a good competitor for the above usecase. However, we can also use builders to pack a [Word8] list into a strict or lazy bytestring; and there, we definitely do have good competitors.
Packing [Word8]
For our benchmark, we use the following implementations for packing [Word8] lists.
The implementations S.pack, L.pack, declPackLazy, and binaryDeclPackLazy are trivial. The implementations packStrict and packLazy make use of fromWord8s :: [Word8] -> Builder, which is a very efficient function to serialize lists of bytes, as the following plot shows.
The plot is a log-log plot of the mean time for packing [Word8] lists using the above implementations when being run on a Core2 Duo T7500 with 2GB RAM and Linux 2.6.32-24 i686 and GHC 6.12.3. I created this plot by adapting Bryon O'Sullivan's excellent Criterion benchmarking library to handle scaling benchmarks (cf. ScalingBenchmarks.hs). In the spirit of Criterion, I also generate a boxplot version for every scaling benchmark (using more transparent lines to draw the quartiles and whiskers), which allows us to judge the quality of the measurements. The boxplot version of the above plot shows that nothing went wrong during its measurement.
Note that the mean times are plotted with respect to a logarithmic scale. Hence, a constant difference between two graphs means a constant factor improvement. As you can see from the measurement log, using blaze-builder is a definitive win for output sequences longer than 1kb: packStrict beats S.pack by almost a factor 2 and packLazy beats L.pack by a factor 10 and binaryDeclPackLazy by a factor 92 (!).
The crucial ingredient for this improvement is the fromWord8s function. It is constructed using the Write abstraction Jasper introduced during his work on blaze-html. The function fromWrite8List forces and writes eight list elements at a time, which allows the compiler to bundle the actual writes to the output buffer.
For shorter output sequences, the improvement gained from using blaze-builder gets smaller and S.pack is even faster for very short sequences. The following plot, its boxplot version, and the measurement log give a more detailed comparison for such short sequences.
The results are not surprising when comparing the implementations: packStrict uses toByteString, which simply runs toLazyByteString and copies all chunks into a single buffer of the appropriate size. Hence, packStrict is always slightly slower than packLazy. The S.pack function from Data.ByteString works in two passes over the input list: first, it determines the length of the list and then it copies all bytes to the allocated buffer. Traversing linked lists of bytes is costly and pays off only for very short lists, as there the output buffer allocation cost is dominant. The peak of packLazy at 64 bytes stems from the fact that it first allocates a 64 byte buffer which is copied to a 4kb buffer once its clear that more than 64 bytes are output. This is done to compensate the buffer allocation cost for very short output sequences. It can be switched off using toLazyByteStringWith, if required.
Conclusions
The blaze-builder library provides an expressive and efficient way to construct both lazy as well as strict bytestrings. The accompanying benchmarks show that it improves (often significantly) in all cases over Data.Binary.Builder from the binary package. The benchmarks presented in this post also show that the implementation of blaze-builder compares favorably against special purpose functions for packing [Word8] lists; on a Core2 Duo T7500 with 2GB RAM and Linux 2.6.32-24 i686 and GHC 6.12.3. Yeah, that's what the benchmarks state ;-). However, I expect that the conclusions drawn from them stay also valid for most other settings. For example, the GHC-7.0.1 release candidate makes Data.ByteString.Builder run a bit faster, but still not as fast as blaze-builder.
During the work on blaze-html, I learned from several benchmarks that ensuring a large average chunk size is very important for lazy bytestrings to be efficient. However, many encoding functions on Hackage produce bytestrings or lazy bytestrings. Hence, we have to copy their result again to guarantee large average chunk sizes, which is a waste of resources. Hence, I suggest that encoding functions produce a builder instead of strict or lazy bytestrings. Apart from guaranteeing a fast append and a large average chunk size, this change also simplifies and generalizes the encoding code, as it separates the buffer allocation strategy from the encoding function.
In order for such a change to be effective, I suggest that the bytestring library itself provides an implementation of Data.ByteString.Builder, which would provide a blessed way to incrementally create bytestrings. The blaze-builder library offers one possible implementation path for such a bytestring builder. If the community would see it fit, then I'd be happy to port the builder parts to the bytestring library. The string encodings currently provided by blaze-builder would then move into their own libraries.
Well that's it for now. I will publish more of the experiments I have done during the work on blaze-builder once I find some more time. I'm also looking very much forward to your feedback.
Happy packing :-p
Subscribe to:
Posts (Atom)