Staying up to date with Apache Spark, thanks to Universal Spark
By Thibaut Gourdel & Visaya Saignasith
Apache Spark has been around for more than a decade — and quickly became the de-facto open-source framework for any large-scale processing and advanced analytics. Today Apache Spark is used to process large volumes of data for all kinds of big data and machine learning tasks, from developing new products to detecting fraud.
But even with all that power, it can be difficult to make the most of the ever-changing Apache Spark technology. That’s why we are so excited to announce Universal Spark, Talend’s answer to keeping pace with one of the world’s most popular open-source solutions.
The A-B-Cs of Apache Spark
One of the key capabilities of Apache Spark is its ability to distribute processing tasks to clusters of machines, making it possible to significantly scale advanced analytics efforts. It’s in this context that Talend has integrated Apache Spark core libraries, making it possible for our customers to turn on large-scale ETL use cases, and still allow for various deployment options.
Apache Spark can operate in a standalone cluster mode that runs on a single machine of your choice. This method is appropriate for limited processing tasks and testing purposes. However, for larger volumes and production tasks, the likelihood is that Spark tasks will be deployed on on-prem clusters or managed services such as Cloudera, Amazon EMR, Google Dataproc, Azure Synapse, and Databricks. Those vendors provide data platform products that feature a mix of open-source and proprietary technologies to streamline clusters management, orchestration, and job deployment on Spark clusters — thus removing the complexity and the cost to manage such infrastructure.
As one of the key distributed processing frameworks, Apache Spark is backed by a strong open-source community and new releases are frequently introduced. Apache Spark has continually expanded its footprint over the years, adding streaming data processing, machine learning, graph processing, and support for SQL among other features.
How can you keep up?
The cadence of Apache Spark releases can be challenging for data teams and vendors alike. The pain of misalignment between different data vendors, data platforms, and data teams can be very real, slowing down new initiatives and delaying the desired business outcomes.
To address this challenge, Talend 8 has introduced a new Universal Spark capability. Universal Spark’s benefits are twofold:
- First, this mechanism allows Talend to be generically compatible with all the releases for a same big data platform distribution and major Spark version.
- Second, because of this standardization, new Spark releases are more easily integrated with the Talend platform, speeding up the support of new version.
Data teams will benefit from the flexibility of developing Spark jobs once and deploying them on any data platforms accelerating cloud migration efforts. Universal Spark also provides them with access to the latest and greatest Spark improvements and upgrades faster, leading to unlocking more innovation.
Talend supports Universal Spark with versions 3.0.x/3.1.x/3.2.x/3.3.x, providing a path to the latest Spark clusters for Databricks, AWS EMR, Cloudera CDP, and Google Dataproc runtimes. Moving forward, Talend guarantees to stay up to date with the latest Apache Spark version on all major data platforms to support our customers in their data management strategy and modernization for advanced analytics.