2026-05-05 数据工程信息播报
(含中文简译 + 原链,仅收集,请自行查验)
趋势话题
以下话题被多个信源同时报道

Apache Iceberg v3 多厂商集体站队
综合摘要:Apache Iceberg v3 引入 Row Lineage、Deletion Vectors 和 VARIANT 数据类型,Databricks、Snowflake、BigQuery、AWS EMR、Dremio 等主流厂商几乎同时宣布支持 Iceberg v3 公开预览。这标志着开放湖仓格式正式成为行业共识,终结了"性能 vs 互操作性"的取舍。
相关链接:
- Databricks 官博报道:https://www.databricks.com/blog/next-era-open-lakehouse-apache-icebergtm-v3-public-preview-databricks
- 掘金深度分析:https://juejin.cn/post/7628144748706447366
- Databricks Delta Sharing 支持 Iceberg:https://www.databricks.com/blog/announcing-first-class-support-iceberg-format-databricks-delta-sharing
ClickHouse $15B 估值 + 收购 Langfuse + 推出 Postgres 服务
综合摘要:ClickHouse 年初完成 $400M Series D 融资,估值达 $150 亿,同时收购 LLM 可观测性公司 Langfuse 并推出原生 Postgres 服务,从纯分析数据库向 AI 基础设施平台全面扩张。
相关链接:
- ClickHouse 官方新闻:https://clickhouse.com/company/news
- SiliconAngle 报道:https://siliconangle.com/2026/01/16/database-maker-clickhouse-raises-400m-acquires-ai-observability-startup-langfuse/
- TasrieIT 深度分析:https://tasrieit.com/blog/clickhouse-news-2026
Apache Spark 4.0 重大架构升级
综合摘要:Spark 4.0 是项目诞生以来变化最大的版本升级——全新 VARIANT 数据类型、原生 SQL UDF、重新设计的基础设施架构、查询性能提升 20-40%,并要求 JDK 17+。阿里云 EMR Serverless Spark 已适配 4.0。
相关链接:
- The Next Gen Tech Insider 报道:https://www.thenextgentechinsider.com/pulse/apache-spark-40-unveiled-with-major-performance-gains-and-architectural-overhaul
- 阿里云 EMR 适配报道:https://blog.51cto.com/u_15316473/14564210
- Spark 官方新闻:https://spark.apache.org/news/
头条精选
1. ClickHouse Raises $400M Series D, Valued at $15 Billion, Acquires Langfuse
中文翻译:ClickHouse 完成 4 亿美元 D 轮融资,估值 150 亿美元,收购 AI 可观测性公司 Langfuse- 来源:ClickHouse Official | 融资:$400M Series D
- https://clickhouse.com/company/news
深度点评:ClickHouse 从实时分析数据库向 AI 基础设施平台的转型极具战略眼光。收购 Langfuse 补齐了 LLM 可观测性短板,推出 Postgres 服务则瞄准了 OLTP+OLAP 融合趋势。$150 亿估值背后是 ARR 超 250% 的增长,说明市场对"分析即基础设施"的认可。但与 Snowflake、Databricks 的正面竞争才刚刚开始。
2. Apache Iceberg v3 in Public Preview on Databricks — Row Lineage, Deletion Vectors, VARIANT
中文翻译:Apache Iceberg v3 在 Databricks 上公开预览——引入行血缘、删除向量和 VARIANT 类型- 来源:Databricks Blog | 影响:多厂商同时支持
- https://www.databricks.com/blog/next-era-open-lakehouse-apache-icebergtm-v3-public-preview-databricks
深度点评:Iceberg v3 的三大特性直击湖仓架构痛点:Row Lineage 让增量处理可追溯,Deletion Vectors 解决了 Merge-on-Read 的性能瓶颈,VARIANT 则统一了半结构化数据处理。更关键的是,Databricks、Snowflake、BigQuery 等同时站队,意味着开放表格式之争基本尘埃落定,未来是 Iceberg 的天下。
3. Apache Spark 4.0 Unveiled with Major Performance Gains and Architectural Overhaul
中文翻译:Apache Spark 4.0 发布,带来重大性能提升和架构重构- 来源:The Next Gen Tech Insider | 性能提升:20-40%
- https://www.thenextgentechinsider.com/pulse/apache-spark-40-unveiled-with-major-performance-gains-and-architectural-overhaul
深度点评:Spark 4.0 是该项目十年来最大的一次架构升级。VARIANT 类型与 Iceberg v3 形成呼应,原生 SQL UDF 让数据分析更贴近传统数据库体验,但强制 JDK 17 和放弃 Scala 2.12 的迁移成本不容忽视。阿里云 EMR 已率先适配,说明国内云厂商对 Spark 4.0 的跟进速度很快。
4. Snowflake Cortex Code CLI Adds dbt and Apache Airflow Support
中文翻译:Snowflake Cortex Code CLI 新增 dbt 和 Apache Airflow 支持- 来源:TechIntelPro | 状态:GA
- https://techintelpro.com/news/ai/enterprise-ai/snowflake-cortex-code-cli-adds-dbt-airflow-support
深度点评:Snowflake 将 AI 编码助手延伸到数据工程领域,支持 dbt 和 Airflow 意味着数据工程师可以在本地开发环境中获得上下文感知的 AI 辅助。这标志着 AI 编程工具从通用代码生成向垂直领域深度集成的转变,对 dbt 和 Airflow 生态是重大利好。
ETL / 数据管道(Airflow / dbt / DolphinScheduler)
1. How AI is Transforming Modern Data Pipelines
中文翻译:AI 如何变革现代数据管道- 来源:dbt Blog | 主题:AI+数据管道
- https://www.getdbt.com/blog/how-ai-changes-data-pipelines
2. Airflow DAGs, Tasks, and Operators: A Complete Beginner’s Walkthrough
中文翻译:Airflow DAG、任务和操作符:完整入门指南- 来源:Dev.to | Reactions:6

- https://dev.to/rose1845/airflow-dags-tasks-and-operators-a-complete-beginners-walkthrough-5gf3
3. Fixing Floating-Point Drift While Speeding Up CSV Ingestion (7.75s→2.7s)
中文翻译:修复浮点漂移同时加速 CSV 导入(7.75秒降至2.7秒)- 来源:Dev.to | 标签:python, c, performance
- https://dev.to/nareshcn2/fixing-floating-point-drift-while-speeding-up-csv-ingestion-775s-27s-10no
4. Case Study: Reducing Data Ingestion Latency by 96.4% (24.5x Speedup)
中文翻译:案例研究:将数据导入延迟降低 96.4%(24.5倍加速)- 来源:Dev.to | 标签:python, distributedsystems
- https://dev.to/nareshcn2/case-study-reducing-data-ingestion-latency-by-964-245x-speedup-520h
5. 2026年数据集成三大趋势:从批处理到实时事件驱动
- 来源:掘金 | 主题:CDC, ELT云原生化
- https://juejin.cn/post/7631469947235172386
计算引擎(Spark / Presto / Trino)
1. Apache Spark 4.0 — VARIANT, Native SQL UDF, 20-40% Performance Gains
中文翻译:Apache Spark 4.0——VARIANT 类型、原生 SQL UDF、20-40% 性能提升- 来源:The Next Gen Tech Insider | 版本:4.0
- https://www.thenextgentechinsider.com/pulse/apache-spark-40-unveiled-with-major-performance-gains-and-architectural-overhaul
2. Spark 4.2.0-preview1 Released for Community Testing
中文翻译:Spark 4.2.0 预览版发布,供社区测试- 来源:Apache Spark Official | 版本:4.2.0-preview1
- https://spark.apache.org/news/
apache / spark
中文翻译:Apache Spark - 统一的大规模数据分析引擎
Apache Spark - A unified analytics engine for large-scale data processing
语言: Scala |
+8 today- https://github.com/apache/spark
apache / datafusion-comet
中文翻译:Apache DataFusion Comet Spark 加速器
Apache DataFusion Comet Spark Accelerator
语言: Scala |
+1 today- https://github.com/apache/datafusion-comet
apache / gluten
中文翻译:Gluten——将 JVM SQL 引擎执行卸载到原生引擎的中间层
Gluten is a middle layer responsible for offloading JVM-based SQL engines’ execution to native engines
语言: Scala |
+1 today- https://github.com/apache/gluten
实时流处理(Kafka / Flink)
1. Apache Kafka 4.2.0 Release — Server-Side Rebalance GA, Dead Letter Queue Support
中文翻译:Apache Kafka 4.2.0 发布——服务端重平衡 GA、死信队列支持- 来源:Confluent Blog | 版本:4.2.0
- https://www.confluent.io/blog/apache-kafka-4-2-release/
2. Confluent Cloud Q1 2026 — A2A Integration for Streaming Agents
中文翻译:Confluent Cloud 2026 Q1 发布——Streaming Agents 的 Agent2Agent 集成- 来源:Confluent Blog | 特性:A2A, Flink Agents
- https://www.confluent.io/blog/2026-q1-confluent-cloud-launch/
3. Flink CDC 3.6.0 Release — Extends Flink 1.20.x and 2.2.x Support
中文翻译:Flink CDC 3.6.0 发布——扩展 Flink 1.20.x 和 2.2.x 支持- 来源:Apache Flink Official | 版本:3.6.0
- https://flink.apache.org/
4. Event-Driven Architectures for AI Pipelines: Kafka + Flink Technical Deep Dive
中文翻译:AI 管道的事件驱动架构:Kafka + Flink 技术深度解析- 来源:DasRoot | 主题:事件驱动, AI管道
- https://dasroot.net/posts/2026/03/event-driven-architectures-ai-pipelines-kafka-flink/
AutoMQ / automq
中文翻译:AutoMQ——基于 S3 的无盘 Kafka,10倍成本效益
AutoMQ is a diskless Kafka® on S3. 10x Cost-Effective. No Cross-AZ Traffic Cost. Autoscale in seconds
语言: Java |
+19 today- https://github.com/AutoMQ/automq
湖仓一体(Iceberg / Hudi / Delta Lake)
1. Apache Iceberg v3 — Row Lineage, Deletion Vectors, VARIANT
中文翻译:Apache Iceberg v3——行血缘、删除向量、VARIANT 类型- 来源:Databricks Blog | 状态:Public Preview
- https://www.databricks.com/blog/next-era-open-lakehouse-apache-icebergtm-v3-public-preview-databricks
2. 四大厂商突然集体站队 Iceberg v3
中文翻译:四大厂商(Databricks/Snowflake/BigQuery/AWS)突然集体支持 Iceberg v3- 来源:掘金 | 日期:2026-04-14
- https://juejin.cn/post/7628144748706447366
3. Databricks Delta Sharing First-Class Support for Iceberg Format
中文翻译:Databricks Delta Sharing 首次原生支持 Iceberg 格式- 来源:Databricks Blog | 客户:SAP, Walmart, Atlassian, LSEG
- https://www.databricks.com/blog/announcing-first-class-support-iceberg-format-databricks-delta-sharing
4. Apache Iceberg Rust 0.9.0 and Python 0.11.0 Released
中文翻译:Apache Iceberg Rust 0.9.0 和 Python 0.11.0 发布- 来源:Apache Iceberg Official | 贡献者:50+ contributors, 28 first-timers
- https://iceberg.apache.org/blog/
delta-io / delta
中文翻译:Delta Lake——支持湖仓架构的开源存储框架
An open-source storage framework that enables building a Lakehouse architecture
语言: Scala |
+2 today- https://github.com/delta-io/delta
5. Apache Doris + Paimon: A Faster Lakehouse for Web3 On-Chain Analytics
中文翻译:Apache Doris + Paimon:面向 Web3 链上分析的更快湖仓- 来源:VeloDB Blog | 性能:ETL 比 Spark 快 5x,查询比 Trino 快 2x
- https://www.velodb.io/blog/apache-doris-paimon-a-faster-lakehouse-for-web3-onchain-analytics
OLAP 引擎(ClickHouse / Doris / StarRocks)
1. ClickHouse Raises $400M, Valued at $15B, Acquires Langfuse, Launches Postgres
中文翻译:ClickHouse 融资 4 亿美元估值 150 亿,收购 Langfuse,推出 Postgres 服务- 来源:SiliconAngle | 估值:$15B
- https://siliconangle.com/2026/01/16/database-maker-clickhouse-raises-400m-acquires-ai-observability-startup-langfuse/
2. ClickHouse vs StarRocks 2026: Real-Time Analytics Database Comparison
中文翻译:ClickHouse vs StarRocks 2026:实时分析数据库对比- 来源:TasrieIT | 主题:OLAP 对比
- https://tasrieit.com/blog/clickhouse-vs-starrocks-2026
3. Does ClickHouse Support UPDATEs? A 2026 Data Analysis
中文翻译:ClickHouse 支持 UPDATE 吗?2026 年数据分析- 来源:Dev.to | Reactions:6

- https://dev.to/manveer_chawla_64a7283d5a/does-clickhouse-support-updates-a-2026-data-analysis-4m75
4. 2026年主流开源数据仓库全解析:从 ClickHouse 到 Doris
中文翻译:2026 年主流开源数据仓库全解析:ClickHouse、Doris、StarRocks 对比- 来源:腾讯云 | 版本:Doris v4.0, StarRocks 存算分离
- https://cloud.tencent.com/developer/article/2644046
向量数据库(Milvus / Weaviate)
1. 2026年向量数据库选型指南:Qdrant、Pinecone、Milvus、Weaviate 与 Chroma 深度解析
中文翻译:2026 Vector Database Selection Guide: Deep Analysis of Qdrant, Pinecone, Milvus, Weaviate, and Chroma- 来源:掘金 | 主题:向量数据库选型
- https://juejin.cn/post/7629524163644981311
2. 向量数据库是必要之恶,但不是银弹
中文翻译:Vector Databases Are a Necessary Evil, Not a Silver Bullet- 来源:掘金 | 日期:2026-04-28
- https://juejin.cn/post/7633625691302084654
DataOps / 数据治理
1. Key Tools and Trends Shaping Data Engineering in 2026
中文翻译:2026 年塑造数据工程的关键工具和趋势- 来源:B2BDaily | 日期:2026-05-05
- https://b2bdaily.com/it/key-tools-and-trends-shaping-data-engineering-in-2026/
2. 2026数据工程:从ETL到全面自治
中文翻译:2026 Data Engineering: From ETL to Full Autonomy- 来源:掘金 | 主题:AI自治数据工程
- https://juejin.cn/post/7585458459165753378
3. What Is Apache Polaris? Why Open Data Catalogs Matter
中文翻译:Apache Polaris 是什么?为什么开放数据目录如此重要- 来源:Dev.to (AWS Builders) | Reactions:6

- https://dev.to/aws-builders/what-is-apache-polaris-why-open-data-catalogs-matter-and-how-to-use-them-with-aws-5gal
4. 6 Best DataOps Tools Compared & Reviewed for 2026
中文翻译:2026 年 6 款最佳 DataOps 工具对比评测- 来源:Airbyte | 工具:Orchestra, Monte Carlo, Great Expectations, DataHub, Collibra
- https://airbyte.com/top-etl-tools-for-sources/best-dataops-tools-compared-reviewed
开源项目 & 工具
cocoindex-io / cocoindex
中文翻译:面向长周期 Agent 的增量数据引擎
Incremental engine for long horizon agents
语言: Python |
+166 today- https://github.com/cocoindex-io/cocoindex
sansan0 / TrendRadar
中文翻译:AI 驱动的舆情监控与趋势追踪工具,支持多平台聚合和 RSS
AI-driven public opinion & trend monitor with multi-platform aggregation, RSS, and smart alerts
语言: Python |
+288 today- https://github.com/sansan0/TrendRadar
Bypassing the Python GIL: How I Processed 10M Rows in 0.26s with C
中文翻译:绕过 Python GIL:如何用 C 在 0.26 秒内处理 1000 万行数据- 来源:Dev.to | 标签:python, cpp, performance
- https://dev.to/nareshcn2/bypassing-the-python-gil-how-i-processed-10m-rows-in-026s-with-c-5apa
I built a DuckDB extension to handle chemistry data without pandas or RDKit
中文翻译:我构建了一个 DuckDB 扩展来处理化学数据,无需 pandas 或 RDKit- 来源:Dev.to | 标签:rust, duckdb
- https://dev.to/nk_maker/i-built-a-duckdb-extension-to-handle-chemistry-data-without-pandas-or-rdkit-mjk
From Transactions to Insights: How OLTP and OLAP Work Together in Modern Data Pipelines
中文翻译:从事务到洞察:OLTP 和 OLAP 如何在现代数据管道中协同工作- 来源:Dev.to | 标签:datawarehousing, dataengineering
- https://dev.to/byrone_code/from-transactions-to-insights-how-oltp-and-olap-work-together-in-modern-data-pipelines-1ien
The Fairwater Paradox: Microsoft Built a Monster That Needs 900TB/Second of USEFUL Data
中文翻译:Fairwater 悖论:微软建造了一个需要 900TB/秒有效数据的怪物- 来源:Dev.to | 标签:aiinfrastructure, cloudcomputing
- https://dev.to/david_aronchick_ea415de50/the-fairwater-paradox-microsoft-built-a-monster-that-needs-900tbsecond-of-useful-data-4hg6
Deploying Apache Superset on Azure From Scratch
中文翻译:从零开始在 Azure 上部署 Apache Superset- 来源:Dev.to | 标签:azure, superset
- https://dev.to/lfariaus/deploying-apache-superset-on-azure-from-scratch-my-ccf501-assessment-3-27c7
Bringing Sexy Back to Data Engineering: Automating BigQuery and Looker Sync
中文翻译:让数据工程重焕魅力:自动化 BigQuery 和 Looker 同步- 来源:Dev.to | 标签:automation, googlecloud
- https://dev.to/gde/bringing-sexy-back-to-data-engineering-automating-bigquery-and-looker-sync-1n21
编辑推荐
- Apache Iceberg v3 多厂商站队 — 开放表格式之争基本尘埃落定,Iceberg 成为湖仓事实标准 https://www.databricks.com/blog/next-era-open-lakehouse-apache-icebergtm-v3-public-preview-databricks
- ClickHouse $15B 估值 + Langfuse 收购 — 从分析数据库到 AI 基础设施平台的全面转型 https://siliconangle.com/2026/01/16/database-maker-clickhouse-raises-400m-acquires-ai-observability-startup-langfuse/
- Snowflake Cortex Code CLI 支持 dbt + Airflow — AI 编程工具向数据工程垂直领域深度集成 https://techintelpro.com/news/ai/enterprise-ai/snowflake-cortex-code-cli-adds-dbt-airflow-support