ALP: Adaptive Lossless floating-Point Compression - Leonardo Kuffó (CWI)

Описание к видео ALP: Adaptive Lossless floating-Point Compression - Leonardo Kuffó (CWI)

DSDSD - THE DUTCH SEMINAR ON DATA SYSTEMS DESIGN:
We hold bi-weekly talks on Fridays from 3:30 PM to 5 PM CET for and by researchers and practitioners designing (and implementing) data systems. The objective is to establish a new forum for the Dutch Data Systems community to come together, foster collaborations between its members, and bring in high-quality international speakers. We would like to invite all researchers, especially also Ph.D. students, who are working on related topics to join the events. It is an excellent opportunity to receive feedback early on from researchers in your field.

Website: https://dsdsd.da.cwi.nl/
Twitter:   / dsdsdnl  

Title: ALP: Adaptive Lossless floating-Point Compression

Abstract: In data science, floating-point data is more prominent than in
traditional database scenarios. IEEE 754 doubles do not exactly
represent most real values, introducing rounding errors in computations
and [de]serialization to text. These rounding errors inhibit the use of
existing lightweight compression schemes such as Delta and Frame Of
Reference (FOR), but recently new schemes were proposed: Gorilla, Chimp,
Chimp128, PseudoDecimals (PDE), Elf and Patas. However, their
compression ratios are not better than those of general-purpose
compressors such as zstd; while [de]compression is much slower than
Delta and FOR. We propose and evaluate ALP, which significantly improves
these previous schemes in both speed and compression ratio. We created
ALP after carefully studying the datasets used to evaluate the previous
schemes. To obtain speed, ALP is designed to fit vectorized execution.
This turned out to be key for also improving the compression ratio, as
we found in-vector commonalities to create compression opportunities.
ALP is an adaptive scheme that uses a strongly enhanced version of
PseudoDecimals for doubles that originated as decimals, and otherwise
uses vectorized compression of the front bits. Its high speeds stem from
our implementation in scalar code that auto-vectorizes, and an efficient
two-stage compression algorithm that first samples row-groups and then
vectors.

Bio: MSc student at VU Amsterdam & UvA. Currently doing research on data
compression at CWI in the Database Architectures research group. Former
researcher on opinion mining and social networks analysis at ESPOL
University (Ecuador). Former data engineer intern at CERN and Amazon EU.

Комментарии

Информация по комментариям в разработке