1/18 I'm incredibly excited to announce @duckdb in @PostgreSQL in the latest version of @paradedb. When we set to bring fast analytics to Postgres earlier this year, we did not expect the journey to take us here. Here's what happened 🧵
Support for Apache Iceberg tables in Postgres has arrived in the 0.8.0 release of pg_lakehouse 🎉 Another surprise: the extension is now powered by DuckDB 🦆 blog.paradedb.com/pages/iceb…
2/18 The first product we built was pg_search, Elastic-level full-text search in Postgres. During the @ycombinator S23 batch we kept hearing complaints about needing to ETL data over to Elastic from Postgres and set out to fix it
1
13
3/18 We did it by integrating the Tantivy search library, made by @fulmicoton and the @Quickwit_Inc team, into Postgres. pg_search is a pillar of @ParadeDB and has put us on the map with some of the largest enterprises in the world. It is still under active development
1
1
11
4/18 In late 2023, our pg_search users told us the key reason keeping them on Elastic was its support for both full-text search and fast analytics over the same queries. This set us looking into analytics in Postgres
1
8
5/18 We discovered great Postgres OLAP products like @citusdata columnar, @Greenplum, @AWSredshift, and a few others. They focused on data warehousing, while we wanted to focus on user-facing search+analytics
1
11
6/18 We launched pg_analytics in February, at the time the world's fastest analytics in Postgres, by integrating the @ApacheDataFusio into a Postgres Table Access Method, swapping the storage for Parquet files. The extension blew up on @hackernews, and we got a ton of feedback
1
10
7/18 People liked the idea, but it had too many limitations. TAMs are extremely complex and still too limited to offer this functionality well. The folks at @orioledb (now part of @supabase) are working alongside Postgres core committers to improve the already-amazing TAM API even further
1
12
8/18 Moreover, our users told us that their analytics data lives in @ApacheIceberg and @DeltaLakeOSS table formats in cloud object stores like AWS S3, GCS, and Azure Blob. In most cases, it just wasn't in Postgres itself
1
9
9/18 What they really wanted was to query it from Postgres to expose user-facing analytics and interoperate it with OLTP and full-text search data. pg_analytics was directionally correct, but it wasn't fully right, especially with its technical limitations
1
8
10/18 We deprecated pg_analytics and shortly after released pg_lakehouse - our new Postgres analytics extension built as a Foreign Data Wrapper. It is designed for fast analytics from Postgres over cloud object storage file formats (Parquet, etc.) and table formats (Iceberg, etc.)
1
9
11/18 Thanks to the FDW API, we don't need to modify Postgres storage, which means pg_lakehouse does not have the limitations of pg_analytics. It works out-of-the-box with the wider Postgres ecosystem, and could be built in @rustlang rapidly thanks to the wrappers project from the amazing folks @supabase
1
11
12/18 The first version of pg_lakehouse was also built on @ApacheDataFusio, which we released a few weeks ago. However, many limitations started to emerge as DataFusion is most optimal for deep integration at the query and storage layers like we had set to do with pg_analytics
1
6
13/18 I was fortunate to meet @andygrove_io @andrewlamb1111 @wesmckinn @qphou @sam_synnada and a few of the other amazing folks building DataFusion. We love DataFusion, but it wasn't the most optimal tool for us anymore. Our needs had changed, but our DataFusion love story isn't over
1
8
14/18 Moreover, our users kept trying to query in pg_lakehouse using @duckdb syntax, and we knew we had to switch. This brings today's release of pg_lakehouse v2, built on @duckdb. More on the technical know-how in the blog post, written by my co-founder @themingying (seriously the man can write)
1
1
9
15/18 @duckdb is a wonderful library that just announced v1.0.0. It belongs in Postgres, and the best place to put it is as a foreign data wrapper. We've been inspired by the amazing folks @crunchydata who are building a similar product for Crunchy Bridge for Analytics. We hope pg_lakehouse can be the open-source, community-led version for all those who aren't Crunchy users
1
2
7
16/18 As for @ApacheDataFusio, we're planning to bring it back elsewhere. We first came to analytics from full-text search, and there's a lot DataFusion can bring to enabling fast search+analytics in the same queries. I know @fulmicoton is already thinking about a few things here, and so are we... Stay tuned!
1
7
17/18 We're hard at work building an Elasticsearch alternative on Postgres. The future of data infra is zero-ETL between Postgres tools without heavy migrations needed. We're committed to making it happen, and are just getting started
1
6
18/18 Until then, we're going back to work. There's a lot more left to do on pg_lakehouse. If you find this interesting, we welcome contributions. You can find our repo here: github.com/paradedb/paradedb. And of course, we always appreciate a ⭐️ ✌️🐘 🧵/

Jun 27, 2024 · 2:31 PM UTC

12