Do we still need Impala and Kudu?

Hi @wesm,

I do really impressed with Apache Arrow. It’s a really game changer.

Currently we experimenting with Dremio on AWS. It’s very promising. Currently comparing the cost from previous architecture.

We use this stack basically Parquet, Dremio/Arrow and Hudi on top of S3 with Spark as Compute.

Is Impala + Kudu still relevant today? I know these stuff from my experience in Banking before with Cloudera.

And why in previous talk you don’t mention Hudi? How that’s compare with Iceberg and Delta Lake from Databricks?

Cheers

I think Kudu is still interesting as a distributed column store backend for analytics workloads where you have a high volume of inserts/updates/deletes. I’m not sure how widely deployed the Impala + Kudu stack is (or other combinations of Kudu + SQL Engine) but would be interested to see some non-biased benchmark analyses to show where Kudu is a good fit versus alternatives.

Note that Kudu recently added an Arrow-compatible columnar orientation to its client protocol – there’s a WIP patch that adds the glue to get column batches out as pyarrow.RecordBatch here

https://gerrit.cloudera.org/#/c/15661/

Thanks Wes,

I have lots of issues on using Cloudera in Banking before. As we can’t update efficiently and sometimes business user need some adjustment. Kudu bring the water into the desserts as we just can use it and it also high perf (C++) less OOM.

Arrow/Dremio change this view, because it’s an open source and it’s cost saving especially for startup. So it’s a huge win. Currently on one client we don’t have any real-time use case. So we can just dump into S3/Parquet + use Dremio.

Another client I have also have real-time use case for IoT/O&G/HFT comes up that required high density visualization, downsampling for viz, range query for time series, analytics, etc.

Is there any time series database powered by Arrow? I can’t find any. My options would be use Dremio approach to S3/Parquet or use Kudu/Impala/Dremio.

Any thought?