Avoid Data Copies in a Data Lake

April 13, 2022

Event date: Thursday, April 14, 2022 @ 10 AM PT / 1 PM ET
Registration link: 5 Best Practices to Eliminate Costly Data Copies with a Data Lakehouse for SQL
Speakers: Scott Gay - Solution Architect (LinkedIn) & Preeti Kodikal - Director of Product Marketing (LinkedIn)
Organized by: Dremio

Modern businesses heavily rely on BI tools, dashboards, and many other data-mining tasks. However, running queries on large data lakes can be extremely time-consuming.

This extended query processing time is mainly due to data storage and retrieval methods in data lakes. Data is compressed to its maximum possible extent in data warehouses and lakes, and the processing layer remains shut down until queries wake them up.

Thus data engineers naturally go for creating copies that speed up specific queries.

Few expectations behind creating such copies are:

  • Making performance-optimized (e.g., uncompressed) copies for fast retrieval;
  • Personalized documents to limit the searching space for specific users;
  • Making exact copies to cater to the BI dashboard views and;
  • Optimized documents for data mining and machine learning model training.

Yet, they create data redundancy and partially elude the true purpose of data warehouses and data lakes.

Scott Gay and Preeti Kodikal from Dremio are talking about how to cut such copies and instead use the data lake itself without compromising performance.

How we work

Readers support The Analytics Club. We earn through display ads. Also, when you buy something we recommend, we may get an affiliate commission. But it never affects your price or what we pick.

Connect with us