Tecton is the leading feature platform for real-time machine learning. Rather than build new SQL engines from scratch, Tecton connects to your existing engine to transform raw data into features for machine learning. This talk will cover Tecton’s new integration with Athena for feature engineering. Derek will demonstrate how Tecton with Athena is the fastest way to build feature pipelines and put new models in production.
SQL remains ubiquitous for data retrieval and analytics, yet can be tedious to write, and is downright unusable for business users. The 2-5 business day turnaround time for data projects is both disruptive and frustrating for business users. Data teams are becoming increasingly overwhelmed, and organizations are pushing to empower their “citizen data analysts.” With the advent of AI English-to-SQL platforms like AskEdith, now anyone can work with and query Presto using plain English questions. AskEdith integrates natively with web interfaces like Ahana for a seamless analytics experience.
Ending DAG Distress: Building Self-Orchestrating Pipelines for Presto – Roy Hasson, Upsolver dbt and Airflow is a popular combination for creating and scheduling batch data modeling and transformation jobs that execute in a data warehouse like Snowflake. Presto users querying the data lake need a similar solution that is simple to use and makes it easy to ingest, model, transform and maintain datasets, without having to write or manage complex DAGs. In this session you will learn how Upsolver built a tool that allows engineers, developers and analysts to write data pipelines using SQL. Pipelines are automatically orchestrated, are data-aware and maintain a consistent data contract between each stage of the pipeline. You will also learn how to introduce the idea of data products into your company to enable more self-service for your Presto users.
There has been a proliferation of tools in different categories of the modern data stack. This talk will focus on the Headless BI category and Cube’s implementation of Headless BI. Headless BI injects a component between data warehouses and other data sources and tools on the other side of the stack (e.g. CDP, data exploration tools, custom data apps, etc.). This new component encapsulates several critical functions like data modeling, access control, and aggregate awareness while deliberately omitting others, like data visualization and presentation. We’ll explore: – Keeping data models separate from data sources and not substituting data modeling with mere data transformation. – Managing access control centrally, aggregate awareness, and caching in a separate layer upstack from data consumers. – Removing data presentation features and embracing data accessibility via a set of APIs.
PrestoDB recently underwent major architectural updates as the Presto Foundation grows membership and is looking to vastly grow the number of new commits and forks. Achieving this desired end state required successful refactoring and improving of Presto’s already impressive speed, efficiency, reliability, and extensibility. Establishing PrestoDB as a premier Open Source project required a major commitment of time and resources from Meta to ensure the community can benefit from this project for years to come, as well as positioning PrestoDB to evolve beyond what Meta alone could create. Members of the Presto Foundation need more of you to be involved in this major evolution in Presto’s history and core components, and bring your own inventive ideas to the mix.
A Git-like Repository for your Data Lake – Vinodhini Sivakami Duraisamy, Treeverse We tend to adopt practices that improve the flexibility of development and the velocity of code deployment, but how confident are we that the complex data system is safe once it arrives in production? We must be able to experiment in production and automate actions while minimizing customer pain and reducing damage to code and data. If your product’s value is derived from data in the shape of analytics or machine learning, losing it, or having corrupted data, can easily translate into pain. In this session, you will discover how chaos engineering principles apply to distributed data systems and the tools that enable us to make our data workloads more resilient.
AWS Lake Formation is a service that allows data platform users to set up a secure data lake in days. Creating a data lake with Presto and AWS Lake Formation is as simple as defining data sources and what data access and security policies you want to apply. In this talk, Wen will walk through the recently announced AWS Lake Formation and Ahana integration.
Facebook operates Presto at an enormous scale. A critical part of the success of Presto is properly tuning the clusters according to the use case they target. Swapnil Tailor, Basar Onat and Tim Meehan describe important session properties and configuration properties used to configure Presto, and guidance on when and how to use them.
Presto has been widely used at Bytedance in several ways such as in the data warehouse, BI tools, ads etc. And, the Presto team at Bytedance has also delivered many key features and optimizations such as the Hive UDF wrapper, coordinator, runtime filter and so on which extend Presto usages and enhance Presto stabilities. Nowadays, most companies will use both Hive (or Spark) and Presto together. But Presto UDFs have very different syntax and internal mechanisms compared with Hive UDFs. This restricts Presto usage while users need to maintain 2 kinds of functions. In this talk, we will present a way to execute Hive UDF/UDAF inside Presto.
PrestoDB is built to be cloud agnostic and container-friendly, but getting it to run on Kubernetes in the cloud can be challenging. In this talk, Gary Stafford (AWS) and Dipti Borkar (Ahana) will discuss: Why use the in-VPC deployment model with AWS and demo, etc – Deploying PrestoDB on AWS EKS using the Ahana Cloud managed service within the user’s AWS account.
Apache Parquet is the major columnar file storage format used by Apache Presto and several other query engines in many big data analytic frameworks today. In a lot of use cases, a portion of the column data is highly sensitive and must be protected. Column encryption at the file format level is supported in the Parquet community. Due to the rewritten code of Parquet in Presto, Parquet column encryption at Presto needs to be ported with modifications to the Presto code page. And the integration with Key Management Service (KMS) and other query engines like Hive and Spark is another challenge. In this talk, we will show the work we have done for enabling Presto for Parquet column decryption including challenges, solutions, integration with Hive/Spark Parquet column encryption and look forward to the next step of encryption work.
In this talk we are going to introduce Presto cross environment query federation which will enable query execution across different clouds and on-prem Presto clusters. This helps in reducing the network data transfer which results in lower Egress and Ingress costs when we are querying across clouds.
In this talk we are going to introduce Hudi, discuss different table/query types and how Hudi integrates with Presto to support these queries. We like to share our experience on how this integration has evolved over time and also discuss upcoming file listing and query planning improvements in Presto Hudi queries.
Here, Chunxu and Beinan would like to share what they have learned in developing a highly-scalable query predictor service through applying machine learning algorithms to ~10 million historical Presto queries to classify queries based on their CPU times and peak memory bytes. At Twitter, this service is helping to improve the performance of Presto clusters and provide expected execution statistics on Business Intelligence dashboards.