Best Practices for Using Context Variables with Talend: Part 3
Hello and welcome to Part 3 of my best practices guide on context variables! Before I get started, I just want to inform you that this blog builds on concepts discussed in Part 1 and Part 2. Read those before you get started.
Let’s say that we decide that we want our Talend Jobs to be able to run on any environment after they have been compiled. We do not want to have to compile them again. We want to maintain our Context variable values in a database (which at design time we have no idea of its location) and we want to keep the database connections details hidden so that they cannot easily be found by someone who might get access to the servers. How can this be done? Will this make the framework incredibly complicated?
What if I said you can do this and keep the system incredibly dynamic and developer friendly, using nothing more than a couple of Operating System Environment Variables, a flat file, and a relatively simple Talend Routine? All you would need to do is configure the Environment Variables on the servers that jobs will be run on and place a flat file on those servers. After that, the jobs will automatically pick up the Context variable values when the jobs start, regardless of which environment they run on, and without any need to change the individual jobs. Would that be useful?
The Solution
The solution I will give to the above problem is one I have evolved over several years and several projects, where some or all of the requirements above (and sometimes more complicated ones) have had to be met.
The Context variable Table
The first thing you will need to do is to set up a database table to hold your context variable values. The schema that I generally use can be seen below (this was written for SQL Server):
CREATE TABLE [context_variables]( [id] [bigint] NOT NULL, [env] [varchar](255) NULL, [key] [varchar](255) NULL, [value] [varchar](255) NULL, [description] [varchar](255) NULL ) |
This is a bare-bones table. I’ve not added any primary keys (although “id” would be the one I’d use), indexes or any other potentially useful columns. You can do that and configure this as you wish. The important columns used in this example are “key”, “value” and “env”. “key” and “value” MUST be named this way. The Implicit Context Load will need the key column (the column holding the Context variable name) to be called “key” and the value column (the column holding the Context variable value) to be called “value”. Both of these columns are Varchars. The Implicit Context Load will implicitly cast (convert) the values into an object of the correct class. The “env” column I am using to demonstrate how you can have different environment’s Context variables in the same table if you wish. I will get to this later.
The Operating System Environment Variables
In order to enable this solution, every server that Talend jobs might be run on (including development environments) will require 2 Operating System Environment Variables; “FILEPATH” and “ENCRYPTIONKEY”. The “FILEPATH” variable will point to a flat file (.properties file) with the database connection settings in it (encrypted where required) and the “ENCRYPTIONKEY” variable holds the encryption/decryption key. These must be System Environment Variables and they must be set before the Talend components (Studio, Jobservers, etc) are started.
The Properties File
This is a simple flat file which holds the connection details of the database holding the Context variables. It is pointed to by the “FILEPATH” Operating System Environment Variable. The keys (variable names) I am using in this example are pretty generic and are loosely related to the required Implicit Context Load parameters. Since they are referenced elsewhere it makes sense to keep them and just change your values, but if you want to change them, just be sure to make sure you change any names I have hardcoded in the Routine (TalendContextEnvironment has been hardcoded, for example). The file format can be seen below:
TalendContextAdditionalParams=instance=TALEND_DEV TalendContextDbName=context_db TalendContextEnvironment=DEV TalendContextHost=MyDBHost TalendContextPassword=ENC(4mW0zXPwFQJu/S6zJw7MIJtHPnZCMAZB) TalendContextPort=1433 TalendContextUser=TalendUser |
The ImplicitContextUtils Routine
I have written a basic routine which allows the Implicit Context Load variables to be set to values supplied in your properties file (pointed to by the “FILEPATH” Operating System Variable). The routine can be seen below:
package routines; import java.io.File; import java.io.FileInputStream; import java.io.FileNotFoundException; import java.io.IOException; import java.util.Properties; import org.jasypt.encryption.pbe.StandardPBEStringEncryptor; import org.jasypt.properties.EncryptableProperties; /*
*/ public class ImplicitContextUtils { //A static Properties class used to hold the context variables in memory after having been //read. private static Properties properties;
} |
There are 4 key parts to this routine which need some explaining.
The “getEnvironmentVariable” Method
This method is a public static method and is used solely to retrieve Operating System Environment Variables. We have 2 Operating System Environment Variables configured in this example; “FILEPATH” and “ENCRYPTIONKEY”. This method is used by the “getProperties” method to retrieve those values and use them to point to your database connection settings properties file.
The “getProperties” Method
This method is a private static method (mainly because it would not be expected for this method to be used on its own by something outside of this Routine) and is used to populate the Properties variable with decrypted property values in key/value pairs. It uses the “getEnvironmentVariable” method to retrieve the Operating System Environment Variables we have set up (the names have been hardcoded within this routine, so MUST be called the same if you use this….or the hardcoded names need changing).
You will also notice that I am using some slightly more complicated JASYPT code here. This code uses the FILEPATH to read the properties file into an EncryptableProperties object. This decrypts those parameters which need decrypting and makes them available via the Properties variable.
The “properties” Variable
The “properties” variable is a private static variable used to keep the database connection properties stored in memory so that the file only needs to be read once per job.
The “getImplicitContextParameterValue” Method
This method is a public static method used to retrieve a value by key from the “properties” variable if it is populated or to call the “getProperties” method and then retrieve the value by key from the “properties” variable. This method is added to each of the Implicit Context Load parameter boxes, with the name of the Context variable supplied as the key with the root PID (Process ID) and the Job PID (Process ID).
The reason for the Root PID and the Job PID is that we do not want to retrieve Context variables in child jobs. If we are passing our Context variables from the Parent Job to the Child Jobs, we may want to keep any changes which may have taken place along the way. To allow this, I have put a little “hack” into the code above. The Implicit Context Load functionality has a “Query Condition” parameter.
I used this to filter by our “env” (environment) column in our context_variables table. The property in the properties file for this is called “TalendContextEnvironment”. Again, I have hardcoded this into my Routine since it will be consistent throughout my entire project, but you can change this if you want. Now when the value “TalendContextEnvironment” is passed into the “getImplicitContextParameterValue” method with the same rootPID and jobPID, the method will know that this is the Parent Job (rootPID and jobPID will only be the same for the Parent Job) and that it needs to retrieve the value for the “Query Condition”. In this case, it will supply a String representing a WHERE CLAUSE for the table we built earlier. Something like below:
env='DEV' |
This will allow the Implicit Context Load to query the database against the correct “env” value. But if the rootPID and jobPID are different, this means that we are dealing with a child job. In which case we do not want ANY Context variables returned. In this case, the method would return:
env='DEV' AND 1=0 |
For the “TalendContextEnvironment” property. You will notice the addition of “0=1”. This is ALWAYS false and as such, no Context variables are returned. The Job will therefore accept the Context values from the Parent Job.
Now you may not wish to do this, in which case you can either modify the code or (more easily) just add the same hardcoded String to both the rootPID and jobPID parameters.
Hooking it all together
Now you should have your Operating System Environment Variables, your Properties file, your Talend Routine, and your Context variable database table. If you have all of these set up, you just need to configure your Implicit Context Load. Now you can do this per job, but it makes much more sense to do it for your whole project. To do it for your whole project go to “File” in your Studio and select “Edit Project Properties”. The following screen should pop-up.
I have partially configured this in the screenshot above. Notice that I have ticked the “Implicit tContextLoad” box to reveal “From File” and “From Database”. I have also selected “From Database” and set the “Property Type”, “Db Type” and “Db Version” for SQL Server (set yours to whichever database type you are using). These values cannot be dynamic, unfortunately.
To configure the rest of the parameters you can use the values below (tweaked to your configuration if you have made changes). If you want to force the Implicit Context Load to run for child jobs then replace rootPid and pid with “” and “”.
Note: rootPid and pid are variables used internally by ALL Talend jobs. They must be used exactly as shown. rootPid clearly corresponds to rootPID, but pid is not so clear. This corresponds to jobPID.
Host:
routines.ImplicitContextUtils.getImplicitContextParameterValue("TalendContextHost",rootPid,pid)
Port:
routines.ImplicitContextUtils.getImplicitContextParameterValue("TalendContextPort",rootPid,pid)
DB Name:
routines.ImplicitContextUtils.getImplicitContextParameterValue("TalendContextDbName",rootPid,pid)
Additional Parameters:
routines.ImplicitContextUtils.getImplicitContextParameterValue("TalendContextAdditionalParams",rootPid,pid)
Password:
routines.ImplicitContextUtils.getImplicitContextParameterValue("TalendContextPassword",rootPid,pid)
Note: this code is encrypted since it normally represents an unencrypted password. To add this simply click on the password ellipsis button and add the above code without quotes surrounding the text.
Table Name:
”context_variables”
NB: At this point, I noticed that “table name” had not been configured in the Properties File example I put together above. I realized that in my eagerness to get this final blog post of the series out, I had made a mistake. I had hardcoded it in the Implicit Context Load settings. I decided to leave it as hardcoded to show that both Routine methods and hardcoded values can be used side by side in the Implicit Context Load settings. This was not in any way left because I didn’t want to revisit everything I had already written….honestly J. If you wish not to hard code this, simply add a variable name to the Properties file, for which a Context variable exists, which will be used to hold the table name. For example, something with the name of TalendContextTableName. Then all you need to do is replace the hard-coded value above with code like below:
routines.ImplicitContextUtils.getImplicitContextParameterValue("TalendContextTableName",rootPid,pid)
Query Condition:
routines.ImplicitContextUtils.getImplicitContextParameterValue("TalendContextEnvironment",rootPid,pid)
Running a Job
Once your Implicit Context Load settings are populated you can run the Job from the Studio, compile (build) the Job and run it from the command line (if the machine has the properties file and Operating System Environment Variables configured) or run it on a JobServer via TAC (again, so long as the properties file and Operating System Environment Variables are configured on the JobServer machine). You simply have to ensure that wherever the Job is run, that the server has the correct environment variables and Properties file on it.
At first glance, this might appear to be quite complicated, but once it is set up, this solution allows you to build your jobs without caring about environments. You will KNOW that your jobs will run against the environment that is configured for the machine that they are running on.
There you have it! That's close to everything you need to know about using Context Variables with Talend. We have one more part to finish the series!
← Part 2 | Part 4 →
Ready to get started with Talend?
More related articles
- What are Data Silos?
- What is Data Extraction? Definition and Examples
- What is Customer Data Integration (CDI)?
- Talend Job Design Patterns and Best Practices: Part 4
- Talend Job Design Patterns and Best Practices: Part 3
- What is Data Migration?
- What is Data Mapping?
- What is Database Integration?
- What is Data Integration?
- Understanding Data Migration: Strategy and Best Practices
- Talend Job Design Patterns and Best Practices: Part 2
- Talend Job Design Patterns and Best Practices: Part 1
- What is change data capture?
- Experience the magic of shuffling columns in Talend Dynamic Schema
- Day-in-the-Life of a Data Integration Developer: How to Build Your First Talend Job
- Overcoming Healthcare’s Data Integration Challenges
- An Informatica PowerCenter Developers’ Guide to Talend: Part 3
- An Informatica PowerCenter Developers’ Guide to Talend: Part 2
- 5 Data Integration Methods and Strategies
- An Informatica PowerCenter Developers' Guide to Talend: Part 1
- Best Practices for Using Context Variables with Talend: Part 2
- Best Practices for Using Context Variables with Talend: Part 4
- Best Practices for Using Context Variables with Talend: Part 1