Delta Lake, which has long been a proprietary part of Databrick’s offering, is already in production use by companies like Viacom, Edmunds, Riot Games and McGraw Hill.
The tool provides the ability to enforce specific schemas (which can be changed as necessary), to create snapshots and to ingest streaming data or backfill the lake as a batch job. Delta Lake also uses the Spark engine to handle the metadata of the data lake (which by itself is often a big data problem). Over time, Databricks also plans to add an audit trail, among other things.
“Today nearly every company has a data lake they are trying to gain insights from, but data lakes have proven to lack data reliability. Delta Lake has eliminated these challenges for hundreds of enterprises. By making Delta Lake open source, developers will be able to easily build reliable data lakes and turn them into Delta Lakes,” said Ali Ghodsi, co-founder and CEO at Databricks.
What’s important to note here is that Delta lake runs on top of existing data lakes and is compatible with the Apache spark APIs.
The company is still looking at how the project will be governed in the future. “We are still exploring different models of open source project governance, but the GitHub model is well understood and presents a good trade-off between the ability to accept contributions and governance overhead,” Ghodsi said. “One thing we know for sure is we want to foster a vibrant community, as we see this as a critical piece of technology for increasing data reliability on data lakes. This is why we chose to go with a permissive open source license model: Apache License v2, same license that Apache Spark uses.”
To invite this community, Databricks plans to take outside contributions, just like the Spark project.
“We want Delta Lake technology to be used everywhere on-prem and in the cloud by small and large enterprises,” said Ghodsi. “This approach is the fastest way to build something that can become a standard by having the community provide direction and contribute to the development efforts.” That’s also why the company decided against a Commons Clause licenses that some open-source companies now use to prevent others (and especially large clouds) from using their open source tools in their own commercial SaaS offerings. “We believe the Commons Clause license is restrictive and will discourage adoption. Our primary goal with Delta Lake is to drive adoption on-prem as well as in the cloud.”
Databricks is launching open source project Delta Lake, which Databricks CEO and cofounder Ali Ghodsi calls the company’s biggest innovation to date, bigger even than its creation of the Apache Spark machine learning library. Delta Lake is a storage layer that sits on top of data lakes to ensure reliable data sources for machine learning and other data science-driven pursuits.
The announcement was made today at the Spark + AI Summit in San Francisco and follows Databricks’ $250 million funding round in February, bringing the company’s valuation to $2.75 billion.
Data lakes are a way to pool data and break down data silos and have grown in popularity with the rise of big data and machine learning.
While they were University of California, Berkeley students, Databricks’ cofounders created the Apache Spark machine learning library. The Apache Software Foundation took over control of the project in 2013. Delta Lake is compatible with Apache Spark and MLflow, Databricks’ other open source project, which debuted last year.
“Delta Lake looks at all the data that’s coming in and makes sure that this data adheres to the schema that you’ve specified. That way, any data that makes it into the Delta Lake will be correct and reliable,” Ghodsi said. “It adds full-blown ACID transaction to any operation you do on your Delta Lake, so operations on Delta Lake are always correct [and] you can never run into sort of partial errors or leftover data.”
Delta Lake can operate in the cloud, in on-premise servers, or on devices like laptops. It can handle both batch and streaming sources of data.
“It lets you now mix batch and streaming data in ways that have been impossible in the past. In particular, you can have one table that you have streaming updates coming into, and you can have multiple concurrent readers that are reading it in streaming or batch. And all of this will just work because of the transaction without any concurrency issues or corruption,” Ghodsi said.
A time travel feature will also allow users to access earlier versions of their data for audits or to reproduce MLflow machine learning experiments. The tool can handle Parquet files used to store large data sets.
A proprietary version of Delta Lake was made available to some Databricks customers a year ago and is now used by more than 1,000 organizations. Early adopters of Delta Lake include Viacom, McGraw-Hill, and Riot Games.