Philippe Noël · Jun 27, 2024 · 2:30 PM UTC

Philippe Noël

Philippe Noël

@philippemnoel

27 Jun 2024

1/18 I'm incredibly excited to announce @duckdb in @PostgreSQL in the latest version of @paradedb. When we set to bring fast analytics to Postgres earlier this year, we did not expect the journey to take us here. Here's what happened 🧵

ParadeDB

@paradedb

27 Jun 2024

Support for Apache Iceberg tables in Postgres has arrived in the 0.8.0 release of pg_lakehouse 🎉 Another surprise: the extension is now powered by DuckDB 🦆 blog.paradedb.com/pages/iceb…

253

Philippe Noël · Jun 27, 2024 · 2:30 PM UTC

Philippe Noël

@philippemnoel

27 Jun 2024

2/18 The first product we built was pg_search, Elastic-level full-text search in Postgres. During the @ycombinator S23 batch we kept hearing complaints about needing to ETL data over to Elastic from Postgres and set out to fix it

Philippe Noël · Jun 27, 2024 · 2:30 PM UTC

Philippe Noël

@philippemnoel

27 Jun 2024

3/18 We did it by integrating the Tantivy search library, made by @fulmicoton and the @Quickwit_Inc team, into Postgres. pg_search is a pillar of @ParadeDB and has put us on the map with some of the largest enterprises in the world. It is still under active development

Philippe Noël · Jun 27, 2024 · 2:30 PM UTC

Philippe Noël

@philippemnoel

27 Jun 2024

4/18 In late 2023, our pg_search users told us the key reason keeping them on Elastic was its support for both full-text search and fast analytics over the same queries. This set us looking into analytics in Postgres

Philippe Noël · Jun 27, 2024 · 2:30 PM UTC

Philippe Noël

@philippemnoel

27 Jun 2024

5/18 We discovered great Postgres OLAP products like @citusdata columnar, @Greenplum, @AWSredshift, and a few others. They focused on data warehousing, while we wanted to focus on user-facing search+analytics

Philippe Noël · Jun 27, 2024 · 2:30 PM UTC

Philippe Noël

@philippemnoel

27 Jun 2024

6/18 We launched pg_analytics in February, at the time the world's fastest analytics in Postgres, by integrating the @ApacheDataFusio into a Postgres Table Access Method, swapping the storage for Parquet files. The extension blew up on @hackernews, and we got a ton of feedback

Philippe Noël · Jun 27, 2024 · 2:30 PM UTC

Philippe Noël

@philippemnoel

27 Jun 2024

7/18 People liked the idea, but it had too many limitations. TAMs are extremely complex and still too limited to offer this functionality well. The folks at @orioledb (now part of @supabase) are working alongside Postgres core committers to improve the already-amazing TAM API even further

Philippe Noël · Jun 27, 2024 · 2:30 PM UTC

Philippe Noël

@philippemnoel

27 Jun 2024

8/18 Moreover, our users told us that their analytics data lives in @ApacheIceberg and @DeltaLakeOSS table formats in cloud object stores like AWS S3, GCS, and Azure Blob. In most cases, it just wasn't in Postgres itself

Philippe Noël · Jun 27, 2024 · 2:30 PM UTC

Philippe Noël

@philippemnoel

27 Jun 2024

9/18 What they really wanted was to query it from Postgres to expose user-facing analytics and interoperate it with OLTP and full-text search data. pg_analytics was directionally correct, but it wasn't fully right, especially with its technical limitations

Philippe Noël · Jun 27, 2024 · 2:30 PM UTC

Philippe Noël

@philippemnoel

27 Jun 2024

10/18 We deprecated pg_analytics and shortly after released pg_lakehouse - our new Postgres analytics extension built as a Foreign Data Wrapper. It is designed for fast analytics from Postgres over cloud object storage file formats (Parquet, etc.) and table formats (Iceberg, etc.)

Philippe Noël · Jun 27, 2024 · 2:30 PM UTC

Philippe Noël

@philippemnoel

27 Jun 2024

11/18 Thanks to the FDW API, we don't need to modify Postgres storage, which means pg_lakehouse does not have the limitations of pg_analytics. It works out-of-the-box with the wider Postgres ecosystem, and could be built in @rustlang rapidly thanks to the wrappers project from the amazing folks @supabase

Philippe Noël · Jun 27, 2024 · 2:31 PM UTC

Philippe Noël

@philippemnoel

27 Jun 2024

12/18 The first version of pg_lakehouse was also built on @ApacheDataFusio, which we released a few weeks ago. However, many limitations started to emerge as DataFusion is most optimal for deep integration at the query and storage layers like we had set to do with pg_analytics

Philippe Noël · Jun 27, 2024 · 2:31 PM UTC

Philippe Noël

@philippemnoel

27 Jun 2024

13/18 I was fortunate to meet @andygrove_io @andrewlamb1111 @wesmckinn @qphou @sam_synnada and a few of the other amazing folks building DataFusion. We love DataFusion, but it wasn't the most optimal tool for us anymore. Our needs had changed, but our DataFusion love story isn't over

Philippe Noël · Jun 27, 2024 · 2:31 PM UTC

Philippe Noël

@philippemnoel

27 Jun 2024

14/18 Moreover, our users kept trying to query in pg_lakehouse using @duckdb syntax, and we knew we had to switch. This brings today's release of pg_lakehouse v2, built on @duckdb. More on the technical know-how in the blog post, written by my co-founder @themingying (seriously the man can write)

Philippe Noël · Jun 27, 2024 · 2:31 PM UTC

Philippe Noël

@philippemnoel

27 Jun 2024

15/18 @duckdb is a wonderful library that just announced v1.0.0. It belongs in Postgres, and the best place to put it is as a foreign data wrapper. We've been inspired by the amazing folks @crunchydata who are building a similar product for Crunchy Bridge for Analytics. We hope pg_lakehouse can be the open-source, community-led version for all those who aren't Crunchy users

Philippe Noël · Jun 27, 2024 · 2:31 PM UTC

Philippe Noël

@philippemnoel

27 Jun 2024

16/18 As for @ApacheDataFusio, we're planning to bring it back elsewhere. We first came to analytics from full-text search, and there's a lot DataFusion can bring to enabling fast search+analytics in the same queries. I know @fulmicoton is already thinking about a few things here, and so are we... Stay tuned!

Philippe Noël · Jun 27, 2024 · 2:31 PM UTC

Philippe Noël

@philippemnoel

27 Jun 2024

17/18 We're hard at work building an Elasticsearch alternative on Postgres. The future of data infra is zero-ETL between Postgres tools without heavy migrations needed. We're committed to making it happen, and are just getting started

Philippe Noël · Jun 27, 2024 · 2:31 PM UTC

Philippe Noël · Jun 27, 2024 · 2:31 PM UTC

Philippe Noël

@philippemnoel

27 Jun 2024

18/18 Until then, we're going back to work. There's a lot more left to do on pg_lakehouse. If you find this interesting, we welcome contributions. You can find our repo here: github.com/paradedb/paradedb. And of course, we always appreciate a ⭐️ ✌️🐘 🧵/

GitHub - paradedb/paradedb: The transactional Elasticsearch alternative built on Postgres

The transactional Elasticsearch alternative built on Postgres - paradedb/paradedb

github.com

Jun 27, 2024 · 2:31 PM UTC