Dell Data Lakehouse Sparks Big Data with Apache Spark

Dell Data Lakehouse + Apache Spark: A path towards a unified platform to simplify big data processing and accelerate insights.

Generative AI is reshaping industries at an unprecedented pace, but data readiness is a challenge that often surfaces as one of the most significant barriers to deploying and scaling GenAI. In fact, a recent study shows that 53% of organizations deal with data quality and timeliness challenges when deploying AI at scale and 48% also deal with data silos or data integration challenges¹. These challenges aren’t new, but with AI’s expanding needs, the complexity and scale of data preparation have grown dramatically. 

  • Data complexity, consistency and volume: AI requires vast, varied datasets that need transformation from raw to structured formats, which can be resource intensive. Challenges like inconsistent schemas, missing values, and complex data types further complicate this, especially as data sources expand. 
  • Governance and security: Handling sensitive data, such as PII and financial records, necessitates strict governance and security compliance. Implementing access controls and encryption in hybrid or multi-cloud environments complicates data preparation.  
  • IT Orchestration: Managing the lifecycle of processing engines — optimizing resource allocation, managing job dependencies, and upgrading to latest, more secure software versions —requires careful orchestration by IT administrators to prevent slowdowns, especially as workloads grow. 

To help combat these challenges, we announced the Dell Data Lakehouse earlier this year. A turnkey data platform that combines Dell’s AI-optimized hardware with a full-stack software suite and powered by Starburst and its enhanced Trino-based query engine, you can now eliminate data silos, unleash performance at scale and democratize insights. 

In partnership with Starburst, we are continuing to push the envelope with innovative solutions to help you excel with AI, through our Dell AI Factory approach. Further to those innovations, we are enhancing the Dell Data Lakehouse with an integration that redefines data prep and analytics with the introduction of a fully managed, deeply integrated Apache Spark engine within the Dell Data Lakehouse. This addition marks a major enhancement, embedding Spark’s industry-leading data processing capabilities directly into the platform. With Spark and Trino working together, the Dell Data Lakehouse offers unparalleled support for diverse analytics and AI-driven workloads, delivering speed, scale and innovation—all under one roof and enabling you to implement the right engine for the right workload, while still managing it all seamlessly through the same management console. 

Spark Becomes Part of Dell Data Lakehouse: Unified Analytics in One Platform 

Here’s how the integrated Spark engine takes the Dell Data Lakehouse to a whole new level: 

  1. AI-Ready Data Preparation: Retrieval Augmented Generation (RAG) and fine-tuning require high-quality datasets to augment large language models. With Spark in the Dell Data Lakehouse, users can create both batch and streaming pipelines to extract, clean and normalize data from structured, semi-structured and unstructured sources—especially valuable for private, enterprise data. Coupled with file system metadata from Dell PowerScale, you can select the right dataset to generate embeddings, or to use in model fine tuning. In the future, native AI functions will automate processing complex data types, like documents, images and audio. 
  2. Fully managed and secured: Spark runs directly inside the Lakehouse, integrated into the turnkey experience of the Dell Data Lakehouse with built-in security. Administrators won’t need to manage each piece of the stack separately, freeing time for innovation. 
  3. Smart resource management: With built-in resource isolation and auto-scaling, admins can tailor resources based on workload needs, ensuring governance across teams. SparkConnect also allows interactive work via notebooks. 
  4. One-Stop Access Control for All Data: The Dell Data Lakehouse enables unified access control across Trino and Spark, allowing users to set policies for structured and unstructured data. 
  5. Ready for open formats: Spark will work seamlessly with open formats like Iceberg via the built-in metastore. This will not only help modernize data from legacy formats to open formats but also ensure customers can use best-of-breed engines for ingestion, processing and querying. 
  6. Enterprise support and upgrades: Enterprise support and software updates will extend to Spark, simplifying procurement and offering a single support experience for the entire stack. 
  7. Hassle-free Setup: With white glove implementation, Dell specialists ensure smooth setup and configuration so that Spark and the Dell Data Lakehouse deliver immediate value. 

Trino and Spark: Why the Combination Matters 

With Spark and Trino in one platform, you can gain the flexibility to use the right engine based on workload type—whether it’s Spark for complex data processing or Trino for fast SQL querying, either on top of a lake or even across distributed data sources without the need to move data.  

Regardless of whether you are preparing data for AI or ML workloads, transforming TB/PB-scale datasets to power analytics like Customer 360, or serving up reports and dashboards, you will be able to do it all without navigating between different systems. 

The Bottom Line: Driving Innovation with Trino and Spark 

This milestone doesn’t just enhance our platform—it lays the groundwork for future innovations that will keep you at the forefront of AI advancements. As businesses continue to operate at the speed of data, Dell Data Lakehouse can equip your data teams to manage even the most demanding workloads. 

Our teams are working hard to deliver this capability sometime in early 2025. Contact your Dell account executive to explore the Dell Data Lakehouse for your data needs. And check out this blog to find out more about the latest release of the Dell Data Lakehouse. 

1 MIT Technology Review Insights. Data Strategies for AI Leaders. 2024. Data strategies for AI leaders | MIT Technology Review.

About the Author: Chad Dunn

Chad Dunn is a seasoned technology executive with over 17 years of experience at Dell Technologies, where he currently serves as the Vice President of Product Management for Artificial Intelligence and Data Management. In this role, Chad leads a dynamic team responsible for defining and delivering cutting-edge AI solutions that cater to a wide range of applications, including generative AI, model training, digital assistants, content and code generation, and data virtualization. His leadership in this domain is helping drive innovation in AI across Dell's global customer base. Previously, Chad held the position of Vice President of Product Management for Dell APEX, where he played a pivotal role in transforming Dell’s portfolio through cloud-like consumption models, subscription services, and as-a-service offerings. He also drove the development of the APEX Console, enabling customers to seamlessly manage their infrastructure across multi-hybrid cloud environments. Earlier in his career, Chad led the product management efforts for Dell’s Hyperconverged Infrastructure (HCI), Converged Infrastructure (CI), and Software Defined Storage (SDS) product lines. Under his leadership, these product lines grew to an impressive $4B run rate, with flagship offerings like VxRail, PowerFlex, Microsoft Azure Stack Hub, and VxBlock. Prior to joining Dell, Chad held various senior roles at innovative companies such as TAZZ Networks, Invento Networks, WaveSmith Networks, and Ciena, where he specialized in product marketing, product management, and driving early-stage technology solutions. Chad is known for his strategic vision, deep expertise in AI, cloud computing, and infrastructure technologies, and his ability to guide products from concept to market success. He is based in Boston, Massachusetts, where he continues to push the boundaries of what technology can achieve in today’s rapidly evolving landscape.