Query Execution Optimization for Broadcast Join using Replicated-Reads Strategy – George Wang, Ahana

Query Execution Optimization for Broadcast Join using Replicated-Reads Strategy – George Wang, Ahana

Today presto supports broadcast join by having a worker to fetch data from a small data source to build a hash table and then sending the entire data over the network to all other workers for hash lookup probed by large data source. This can be optimized by a new query execution strategy as source data from small tables is pulled directly by all workers which is known as replicated reads from dimension tables. This feature comes with a nice caching property given that all worker nodes N are now participating in scanning the data from remote sources. The table scan operation for dimension tables is cacheable per all worker nodes. In addition, there will be better resource utilization because the presto scheduler can now reduce the number plan fragment to execute as the same workers run tasks in parallel within a single stage to reduce data shuffles.

HermesDB – Integrated Presto with a lucene-based Query Engine – Yue Long, Tencent

HermesDB – Integrated Presto with a lucene-based Query Engine – Yue Long, Tencent

HermesDB is the next generation of OLAP engine at Tencent with the architecture featuring separation of storage and calculation. HermesDB characterizes efficient indexing files in storage data, equipping with customized Presto as the core query engine. With the help of Presto connector, HermesDB could not only support full ANSI syntax but also ultilize Apache Lucene as underlying computer core. Besides, we are in the progress of improving the end-to-end performance with the newly released Java Vector APIs, acclecerating different kinds of complex computations with SIMD instructions. According to the benchmark(SSB) we have, HermesDB outperformances other mainstream C++ based MPP engines.