Data and AI Summit 2022 announcements

Destination Lakehouse

Focus on the Lakehouse as a next generation platform, merging data lake & lakehouse.

Simplify all data with 1 platform for data & AI instead of a separate DW for SQL/ dashboarding.

Benchmarks on Data Warehousing & lakehouse workload. 

But beyond benchmarks & costs this allows companies to unlock advanced use-cases that we can’t do with traditional DW.

Data Engineering

Delta Lake 2.0 

All Delta is Open source as part of the Delta 2.0 release, we will include the following features:

  • Support Change Data Feed on Delta tables. 
  • Support Z-Order clustering 
  • Support for idempotent writes to Delta tables 
  • Support for dropping columns in a Delta table as a metadata change operation. 
  • Support for dynamic partition overwrite.
  • Experimental support for multi-part checkpoints

For more information about Delta Lake 2.0 feel free to read the Databricks Blog post

For more information about Delta Lake 2.0 feel free to read the Linux foundation Blog post

Project Lightspeed

The aim of this project is to improve the latency and ensure it’s predictable,enhance functionality for processing data with new connectors and improve ecosystem support for connectors. 

For more information about the project lightspeed feel free to read the Blog post

Spark connect 

Offering Apache SparkTM whenever and wherever, decoupling the client and server so it can be embedded everywhere, from application servers, IDEs, notebooks and all programming languages

Photon 

Photon will be generally available on Databricks Workspaces on AWS and Azure in the coming weeks, further expanding Photon’s reach across the platform. It’s the next generation query engine on Databricks written to be directly compatible with Apache Spark APIs. 

For more information about Photon feel free to read the Blog post

Delta Live Tables 

ETL framework that uses a simple declarative approach to building reliable data pipelines and automatically managing infrastructure at scale.

What’s new ?

  • Advanced Auto Scaling ( Link)
  • CDC slowly changing Dimensions Type 2 (Link)
  • Enzyme ETL Optimizer : New Optimization layer designed to speed up the process of doing ETL

For more information about Delta Live tables feel free to read the Blog post

Databricks workflows

Fully managed orchestration service that is deeply integrated with The Databricks lakehouse platform. It enables engineers to build a reliable data, analytics and ml workflow on any cloud without needing to manage complex infrastructure.

What’s new ?

  • Build reliable production data and Ml pipelines with Git Support
  • Run dbt projects in production
  • Orchestration of SQL Tasks
  • Save time and Money with Repair and rerun
  • Easily share context between tasks

For more information about Databricks workflows feel free to read the Blog post

Data Sharing & Data Governance

Unity Catalog

Unity Catalog is going to be Generally available soon. It’s a unified governance solution for all data assets including files, tables, Dashboards and machine learning models in your lakehouse on any

cloud.

What’s new ?

  • Automated Data Lineage for all workloads
  • Built-in data search and discovery
  • simplified access controls with privilege inheritance
  • information schema

For more information about Unity Catalog feel free to read the Blog Post

Delta Sharing 

Delta Sharing is going to be Generally available soon. It’s an open protocol for secure real-time exchange of large datasets, which enables secure data sharing across products for the first time. We’re developing Delta Sharing with partners at the top software and data providers in the world.

For more information about Delta Sharing feel free to read the Blog post

Databricks MarketPlace

Open marketplace for exchanging data products like datasets, notebooks, dashboards and ML models. Providers can now commercialize new offerings and shorten sales cycles by providing value-added services on top of their data. The marketplace is powered by Delta Sharing

 For more information about Databricks Marketplace feel free to read the Blog post

Databricks Cleanrooms

Databricks Cleanrooms provides a secure hosted environment in which organizations can join their data and perform analysis on the aggregated Data. It allows organizations to meet collaborators on their prefered cloud and provide them the flexibility to run any complex computations & workloads in any language – SQL, R, Scala, Python. 

For more information  about Databricks Cleanrooms feel free to read the Blog post

Best Data Warehouse is a Lakehouse

Databricks SQL

Databricks SQL is a data warehouse on the Databricks Lakehouse Platform that lets you run all your SQL and BI applications at scale with up to 12x better price/performance, a unified governance model, open formats and APIs, and your tools of choice no lock-in

What’s new ?

  • connect from everywhere: Open source connectors for Go, Node.js and Python to access the lakehouse from operational applications.
  • Python user defined functions
  • Materialized views : adding support for incrementally computed MVs in DB SQL to accelerate end-user queries and reduce infrastructure costs with efficient, incremental computation
  • Query federation: Provides the ability to query remote data sourcing like PostgreSQL, Mysql, AWS Redshift…. Without the need to first extract and load the data from the source systems)
  • Preview of Serverless in AWS : Instant elastic SQL serverless compute for low latency and high concurrency. 

For more information about Connect from everywhere feel free to read the Blog post 

For more information about Serverless in AWS feel free to read the Blog post 

Machine Learning

Mlflow 2.0

Open source platform developed by Databricks to help manage the complete machine learning lifecycle.

What’s new ?

  • Serverless model endpoints Serverless Model Endpoints improve upon existing Databricks-hosted model serving by offering horizontal scaling to thousands of QPS, potential cost savings through auto-scaling, and operational metrics for monitoring runtime performance.
  • Model Monitoring  Model Monitoring enables users to understand whether a deployed model works well in production. The proposed solution sets up a framework for logging model inputs/predictions and then analyzing model & data quality trends over time. 
  • Mlflow Pipelines MLflow Pipelines enable Data Scientists to create production-grade ML pipelines that combine modular ML code with software engineering best practices to make model development and deployment fast and scalable. 

For more information about MLflow Pipelines feel free to read the Blog post 

For more information about ML announcements feel free to read the Blog post

On demand content

Did you miss a session ? It’s ok recording from the  250 sessions are available on demand in the Data+AI summit platform through July 15.

Article written by Youssef Mrini