Machine Learning Tools
How the Talend Platform Uses ML to Improve Data Integration
Staying competitive with big data applications and business intelligence in almost any industry requires big data pipelines that can process and analyze massive amounts of data in real time. Machine learning solutions integrated with Microsoft Azure and Apache Spark accelerate the development, and ease the maintenance, of these systems, but many of those machine learning solutions are complicated in and of themselves.
Talend helps reduce the complexities of machine learning (ML) by providing a comprehensive ecosystem of user-friendly, self-service tools and technologies that seamlessly integrate ML into your big data platform. With a lower skills barrier—no need for programmers proficient in complex R, Python, or Java—organizations get to data insights faster and at a lower cost.
Easy-to-use, out-of-the-box machine learning components mean data engineers can focus on big data and building up the distributed system, rather than having to learn how to build models. Data scientists can focus on what they do best: building models and creating algorithms. It allows different people to do different tasks, as needed, increasing efficiency and speeding up time-to-development.
Talend Machine Learning Use Cases
Talend Big Data technologies combined with machine learning components enable businesses to deploy results of the ML process quickly in order to solve pressing business problems. Banks, insurance companies, airlines, hotels, and many other organizations use machine learning. There is a use case for just about any industry and business need.
Paddy Power Betfair (PPB) is the world’s largest publicly quoted sports betting and gaming company, with five million customers worldwide. Using Talend Real-Time Big Data to integrate 70TB of data from multiple sources into an integrated cloud platform, they cut development time in half, significantly increasing data agility and response times.
Out-of-the-Box Machine Learning Components
With the Talend toolset, machine learning components are ready to use off the shelf. This ready-made ML software allows data practitioners, no matter their level of experience, to easily work with algorithms—without needing to know how the algorithm works or how it was constructed. At the same time experts can fine-tune those algorithms as desired.
Machine learning components are built into the Real-Time Big Data platform, allowing users to perform analytics without the need for hand coding. Talend machine learning algorithms are grouped into four areas based on how they work, each containing various ready-to-use ML components:
1. Classification Algorithms
Classification in machine learning is a data mining technique used to find patterns in large datasets. It uses a set of training data containing observations (instances) whose category membership is known, to identify which set of categories (sub-populations) an observation belongs in.
There are two types of classification algorithms:
- Binary classification — There are only two possible outcomes.
- Multi-label classification — There are multiple possible outcomes.
Use cases for classification algorithms include spam detection, image categorization, and mining text for customer sentiment. The goal is to predict a class sub-population, or label, from a known example.
Talend machine learning classification components include tClassify, tClassifySVM, tDecisionTreeModel, tGradientBoostedTreeModel, tLogicRegressionModel, tNaiveBayesModel, tPredict, tRandomForestModel, and tSVMModel.
2. Clustering Algorithms
Cluster analysis (clustering) is a primary task of exploratory data mining, and a common technique used in statistical data analysis.
K-means clustering, for example, is a type of unsupervised learning. It is one of the simplest unsupervised learning algorithms, used to solve the problem of classifying a given set of data through a certain number of clusters. Use cases for K-means include pricing segmentation, determining customer loyalty, and to detect fraud.
Talend machine learning clustering components include tKMeansModel, tPredict, and tPredictCluster.
3. Recommendation Algorithms
Also called a recommender system, this is a subclass of information filtering that seeks to predict the rating or preference that a user would give to an item.
Collaborative filtering is one type of recommendation algorithm. Collaborative filtering can be user-based or item-based. The goal of both approaches is to automatically predict users or items (i.e. filter) based on preferences from many users or items (i.e. collaboration).
Two types of Talend machine learning recommendation components are:
- tALSModel — This component processes a large amount of information, from its preceding Spark components, about users’ preferences for specific products. It performs Alternating Least Squares (ALS) computations over these sets of data, in order to generate and write a fine-tuned product recommender model (Parquet format).
- tRecommend — This component analyzes data from its preceding Spark components using a recommender model to to estimate user preferences. It is based on the user product recommender model generated by the tALSModel, and recommends products to users known by the model.
Recommendation system algorithms can be combined with deep learning techniques to make predictions from massive volumes of big data, similar to YouTube’s deep neural networks recommendation engine created by Google.
Talend machine learning recommendation components include tALSModel and tRecommend.
4. Regression Algorithms
Regression testing is a statistical process for estimating the relationship among variables. It focuses on the relationship between a dependent variable and one or more independent variables, or “predictors.”
To illustrate, the tModelEncoder component receives data from its preceding components, then applies a wide range of feature processing algorithms to transform columns of this data: word to vector, hashing, bucketization, etc. It then sends the result to the model training component—tLogisticRegressionModel or tKMeansModel—that follows, to eventually train and create a predictive model.
Talend machine learning regression components include tModelEncoder, tLinearRegressionModel, and tPredict.
Getting Started with Talend Machine Learning
Talend machine learning leverages Apache Spark on Hadoop and Microsoft Azure for improved scale and performance. Spark allows you to utilize Talend ML components to process and analyze large datasets in real-time. You can very quickly build up a model, then concentrate on the business outcome instead of the development process.