The ODIn Lab - CIDR Recap

How big is BIG and how fast is FAST? This seemed to be a re-occurring theme of the CIDR 2017 conference. A general consensus and major point of many presentations is that RDBMS used to be the king of scaling to large data twenty years ago but for some inexplicable reason has become lost to the ever changing scope of BIG and FAST. Multiple papers attempted to address this problem in different ways and added to multiple different tools on the market for data stream processing and large calculations such as SPARK but there seemed to be no silver bullet. To add to the theme that big data is too big, there were keynote talks given by Emily Galt and Sam Madden that drove this point home and gave different real work scenarios and outlooks on this problem.

To break this theme apart I’ll split the papers into groups and explain the different outlooks the authors took and how they addressed this common problem.

The papers, Prioritizing Attention in Analytic Monitoring, The Myria Big Data Management and Analytics System and Cloud Services, Weld: A Common Runtime for High Performance Data Analysis, A Database System with Amnesia, and Releasing Cloud Databases for the Chains of Performance Prediction Models, were focused on the theme that databases are not keeping pace with the rate that data is growing. Sam Madden brought up an interesting point that the hardware components like the bus are not the bottle neck in this system. With advances in big data computing like apache spark, it feels like RDBMS are the end of the line where data goes to die. These papers looked at different ways of addressing this, ‘A Database System with Amnesia’ looked at throwing out unused data since most data in RDBMS gets put in and never used again and with the increasing use of data streams the problem of not being able to process and store this data fast enough becomes exemplified.

The second common ground problem is even if you can efficiently store and perform queries over your data lakes, humans often lack the ability to efficiently create queries or have the necessary insight into how the data is formatted. The papers, The Data Civilizer System, Establishing Common Ground with Data Context, Adaptive Schema Databases, Combining Design, and Performance in a Data Visualization Management System, all try to address this problem but from slightly different angles. The data civilizer system and adaptive databases look at aiding an analyst in schema and table exploration and to help an analyst discover unknown or desired qualities about their data source. These papers approach user insight in a way that would otherwise exist as internal middleware in large companies, the problem is that big data and messy data lakes are becoming more and more prevalent for other users. Medium sized businesses can be buried in data following user surges or new product upgrades, government agencies can have large amounts of uncleaned sensor and user submitted data that they do not have the abilities or tools to manage.

To me a large take away from this conference was databases need a better way to handle big data. Databases are the hero big data needs AND the one it deserves. To achieve these goals databases are going to need to relax the constraints on ridged schemas and ‘perfect’ data, which open up a large amount of research opportunities and the realization that there might not currently be a ‘right’ answer to this problem. Either way it should be interesting to see what sacrifices RDBMS make to compete with the growing amount of data and if they are able to apply decades worth of research to this hot field that is looking for an answer.