🗄️ 2026-05-05 数据工程信息播报|节选:微软建造了一个需要 900TB/秒有效数据的怪物 |2026年向量数据库选型指南

:file_cabinet: 2026-05-05 数据工程信息播报

(含中文简译 + 原链,仅收集,请自行查验)


:fire: 趋势话题

以下话题被多个信源同时报道

:fire::fire: Apache Iceberg v3 多厂商集体站队

综合摘要:Apache Iceberg v3 引入 Row Lineage、Deletion Vectors 和 VARIANT 数据类型,Databricks、Snowflake、BigQuery、AWS EMR、Dremio 等主流厂商几乎同时宣布支持 Iceberg v3 公开预览。这标志着开放湖仓格式正式成为行业共识,终结了"性能 vs 互操作性"的取舍。
相关链接

:fire: ClickHouse $15B 估值 + 收购 Langfuse + 推出 Postgres 服务

综合摘要:ClickHouse 年初完成 $400M Series D 融资,估值达 $150 亿,同时收购 LLM 可观测性公司 Langfuse 并推出原生 Postgres 服务,从纯分析数据库向 AI 基础设施平台全面扩张。
相关链接

:fire: Apache Spark 4.0 重大架构升级

综合摘要:Spark 4.0 是项目诞生以来变化最大的版本升级——全新 VARIANT 数据类型、原生 SQL UDF、重新设计的基础设施架构、查询性能提升 20-40%,并要求 JDK 17+。阿里云 EMR Serverless Spark 已适配 4.0。
相关链接


:star: 头条精选

1. ClickHouse Raises $400M Series D, Valued at $15 Billion, Acquires Langfuse

  • :memo: 中文翻译:ClickHouse 完成 4 亿美元 D 轮融资,估值 150 亿美元,收购 AI 可观测性公司 Langfuse
  • 来源:ClickHouse Official | 融资:$400M Series D
  • https://clickhouse.com/company/news

:light_bulb: 深度点评:ClickHouse 从实时分析数据库向 AI 基础设施平台的转型极具战略眼光。收购 Langfuse 补齐了 LLM 可观测性短板,推出 Postgres 服务则瞄准了 OLTP+OLAP 融合趋势。$150 亿估值背后是 ARR 超 250% 的增长,说明市场对"分析即基础设施"的认可。但与 Snowflake、Databricks 的正面竞争才刚刚开始。

2. Apache Iceberg v3 in Public Preview on Databricks — Row Lineage, Deletion Vectors, VARIANT

:light_bulb: 深度点评:Iceberg v3 的三大特性直击湖仓架构痛点:Row Lineage 让增量处理可追溯,Deletion Vectors 解决了 Merge-on-Read 的性能瓶颈,VARIANT 则统一了半结构化数据处理。更关键的是,Databricks、Snowflake、BigQuery 等同时站队,意味着开放表格式之争基本尘埃落定,未来是 Iceberg 的天下。

3. Apache Spark 4.0 Unveiled with Major Performance Gains and Architectural Overhaul

:light_bulb: 深度点评:Spark 4.0 是该项目十年来最大的一次架构升级。VARIANT 类型与 Iceberg v3 形成呼应,原生 SQL UDF 让数据分析更贴近传统数据库体验,但强制 JDK 17 和放弃 Scala 2.12 的迁移成本不容忽视。阿里云 EMR 已率先适配,说明国内云厂商对 Spark 4.0 的跟进速度很快。

4. Snowflake Cortex Code CLI Adds dbt and Apache Airflow Support

:light_bulb: 深度点评:Snowflake 将 AI 编码助手延伸到数据工程领域,支持 dbt 和 Airflow 意味着数据工程师可以在本地开发环境中获得上下文感知的 AI 辅助。这标志着 AI 编程工具从通用代码生成向垂直领域深度集成的转变,对 dbt 和 Airflow 生态是重大利好。


:counterclockwise_arrows_button: ETL / 数据管道(Airflow / dbt / DolphinScheduler)

1. How AI is Transforming Modern Data Pipelines

2. Airflow DAGs, Tasks, and Operators: A Complete Beginner’s Walkthrough

3. Fixing Floating-Point Drift While Speeding Up CSV Ingestion (7.75s→2.7s)

4. Case Study: Reducing Data Ingestion Latency by 96.4% (24.5x Speedup)

5. 2026年数据集成三大趋势:从批处理到实时事件驱动


:bar_chart: 计算引擎(Spark / Presto / Trino)

1. Apache Spark 4.0 — VARIANT, Native SQL UDF, 20-40% Performance Gains

2. Spark 4.2.0-preview1 Released for Community Testing

  • :memo: 中文翻译:Spark 4.2.0 预览版发布,供社区测试
  • 来源:Apache Spark Official | 版本:4.2.0-preview1
  • https://spark.apache.org/news/

apache / spark

  • :memo: 中文翻译:Apache Spark - 统一的大规模数据分析引擎
  • :memo: Apache Spark - A unified analytics engine for large-scale data processing
  • :backhand_index_pointing_right: 语言: Scala | :star: +8 today
  • https://github.com/apache/spark

apache / datafusion-comet

apache / gluten

  • :memo: 中文翻译:Gluten——将 JVM SQL 引擎执行卸载到原生引擎的中间层
  • :memo: Gluten is a middle layer responsible for offloading JVM-based SQL engines’ execution to native engines
  • :backhand_index_pointing_right: 语言: Scala | :star: +1 today
  • https://github.com/apache/gluten

:ocean: 实时流处理(Kafka / Flink)

1. Apache Kafka 4.2.0 Release — Server-Side Rebalance GA, Dead Letter Queue Support

2. Confluent Cloud Q1 2026 — A2A Integration for Streaming Agents

3. Flink CDC 3.6.0 Release — Extends Flink 1.20.x and 2.2.x Support

  • :memo: 中文翻译:Flink CDC 3.6.0 发布——扩展 Flink 1.20.x 和 2.2.x 支持
  • 来源:Apache Flink Official | 版本:3.6.0
  • https://flink.apache.org/

4. Event-Driven Architectures for AI Pipelines: Kafka + Flink Technical Deep Dive

AutoMQ / automq

  • :memo: 中文翻译:AutoMQ——基于 S3 的无盘 Kafka,10倍成本效益
  • :memo: AutoMQ is a diskless Kafka® on S3. 10x Cost-Effective. No Cross-AZ Traffic Cost. Autoscale in seconds
  • :backhand_index_pointing_right: 语言: Java | :star: +19 today
  • https://github.com/AutoMQ/automq

:building_construction: 湖仓一体(Iceberg / Hudi / Delta Lake)

1. Apache Iceberg v3 — Row Lineage, Deletion Vectors, VARIANT

2. 四大厂商突然集体站队 Iceberg v3

3. Databricks Delta Sharing First-Class Support for Iceberg Format

4. Apache Iceberg Rust 0.9.0 and Python 0.11.0 Released

  • :memo: 中文翻译:Apache Iceberg Rust 0.9.0 和 Python 0.11.0 发布
  • 来源:Apache Iceberg Official | 贡献者:50+ contributors, 28 first-timers
  • https://iceberg.apache.org/blog/

delta-io / delta

  • :memo: 中文翻译:Delta Lake——支持湖仓架构的开源存储框架
  • :memo: An open-source storage framework that enables building a Lakehouse architecture
  • :backhand_index_pointing_right: 语言: Scala | :star: +2 today
  • https://github.com/delta-io/delta

5. Apache Doris + Paimon: A Faster Lakehouse for Web3 On-Chain Analytics


:chart_increasing: OLAP 引擎(ClickHouse / Doris / StarRocks)

1. ClickHouse Raises $400M, Valued at $15B, Acquires Langfuse, Launches Postgres

2. ClickHouse vs StarRocks 2026: Real-Time Analytics Database Comparison

3. Does ClickHouse Support UPDATEs? A 2026 Data Analysis

4. 2026年主流开源数据仓库全解析:从 ClickHouse 到 Doris


:magnifying_glass_tilted_left: 向量数据库(Milvus / Weaviate)

1. 2026年向量数据库选型指南:Qdrant、Pinecone、Milvus、Weaviate 与 Chroma 深度解析

  • :memo: 中文翻译:2026 Vector Database Selection Guide: Deep Analysis of Qdrant, Pinecone, Milvus, Weaviate, and Chroma
  • 来源:掘金 | 主题:向量数据库选型
  • https://juejin.cn/post/7629524163644981311

2. 向量数据库是必要之恶,但不是银弹


:hammer_and_wrench: DataOps / 数据治理

1. Key Tools and Trends Shaping Data Engineering in 2026

2. 2026数据工程:从ETL到全面自治

3. What Is Apache Polaris? Why Open Data Catalogs Matter

4. 6 Best DataOps Tools Compared & Reviewed for 2026


:package: 开源项目 & 工具

cocoindex-io / cocoindex

sansan0 / TrendRadar

  • :memo: 中文翻译:AI 驱动的舆情监控与趋势追踪工具,支持多平台聚合和 RSS
  • :memo: AI-driven public opinion & trend monitor with multi-platform aggregation, RSS, and smart alerts
  • :backhand_index_pointing_right: 语言: Python | :star: +288 today
  • https://github.com/sansan0/TrendRadar

:rocket: Bypassing the Python GIL: How I Processed 10M Rows in 0.26s with C

I built a DuckDB extension to handle chemistry data without pandas or RDKit

From Transactions to Insights: How OLTP and OLAP Work Together in Modern Data Pipelines

The Fairwater Paradox: Microsoft Built a Monster That Needs 900TB/Second of USEFUL Data

Deploying Apache Superset on Azure From Scratch

Bringing Sexy Back to Data Engineering: Automating BigQuery and Looker Sync


:light_bulb: 编辑推荐

  1. Apache Iceberg v3 多厂商站队 — 开放表格式之争基本尘埃落定,Iceberg 成为湖仓事实标准 https://www.databricks.com/blog/next-era-open-lakehouse-apache-icebergtm-v3-public-preview-databricks
  2. ClickHouse $15B 估值 + Langfuse 收购 — 从分析数据库到 AI 基础设施平台的全面转型 https://siliconangle.com/2026/01/16/database-maker-clickhouse-raises-400m-acquires-ai-observability-startup-langfuse/
  3. Snowflake Cortex Code CLI 支持 dbt + Airflow — AI 编程工具向数据工程垂直领域深度集成 https://techintelpro.com/news/ai/enterprise-ai/snowflake-cortex-code-cli-adds-dbt-airflow-support
1 个赞