Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Context: We are looking to implement test cases for our Talend jobs.
Concern: I noticed in the documentation that once a file is selected to be a part of the test case, it will be saved into a repository.
The data we process is often sensitive in nature, so our tests may often require the use of sensitive data.
Question: Our Talend solution is currently synced with git repository, and we avoid putting sensitive information there. Will sensitives files that we use for test cases be synced with our git repo?
Talend Product: Data Integration 7.3
You shouldn't be using real data for testing unless it has been anonymised and all sensitive content removed/encrypted. Talend Studio can help in preparing this. The data you use with these tests (which are essentially unit tests) will be stored in your git repo.
You shouldn't be using real data for testing unless it has been anonymised and all sensitive content removed/encrypted. Talend Studio can help in preparing this. The data you use with these tests (which are essentially unit tests) will be stored in your git repo.
Hi @Richard Hall thank you for your response. Follow up question for you:
You mentioned that Talend Studio can assist with preparing test data. Can you elaborate on that?
There are several ways in which this can be done. An easy way, using pre-built components can be seen below. What I am doing here is encrypting values (String, double, int and boolean), then decrypting them. I show the values at each stage.....
Below is the output from each stage....
Initial values
Hello|2.09|234|true
Encrypted values
SzpQfaeOsUK1RnsJMQsJcBrWtNTI9tS0ybYz0rb0AiBuEaqH6Q==|MXvepZAWGQvRHhDuQc+sT/YcxtCxtymqVfekg/4HAd8r0crD|K3VAo7BIJ0k6gp+7vdJpTAh4S97cBdkGvwWVRExnRT15scA=|Tz1WYMAiUkh1MiJdkyF9kaSdMZpieoYHRH473A8Ubf6YzDlC
Decrypted values
Hello|2.09|234|true
Since you are encrypting data to be tested, you may find that you only want to encrypt identifying data such as Strings, etc. This may not be entirely suitable, but as I said, this is the most easy way of doing this. Another recommendation I have is to use data masking, which can be carried out in Data Preparation if you have it. This is described here....
https://help.talend.com/r/en-US/8.0/data-preparation-user-guide/masking-data
The last example is a bit more complicated to get your head around and will take more time to setup, but it is VERY good. It is called Bijective Data Masking. It is described below....
https://help.talend.com/r/en-US/8.0/data-privacy/data-masking-functions-in-masking-components
As a quick example of what this can do, it allows your values to be masked to the same values. So, if you mask an "A" to "E". the next time an "A" needs to be masked, it will be masked to "E". It works for numbers as well. So you can keep your keys in your data, meaning that you know that your data will remain consistent in how it is processed.
A quick example can be seen below....
The results can be seen below. I've added notes to show you which tLogRow they are output by (5 or 6) and which rows should be the same (a and b).....
Hello|2.873|23|true --> tLogRow5, row type a
Ozthm|2.679|19|true|false --> tLogRow6, row type a
Goodbye|233.232|876|false --> tLogRow5, row type b
Fdkdkpg|186.862|589|false|false --> tLogRow6, row type b
Hello|2.873|23|true --> tLogRow5, row type a
Ozthm|2.679|19|true|false --> tLogRow6, row type a
Goodbye|233.232|876|false --> tLogRow5, row type b
Fdkdkpg|186.862|589|false|false --> tLogRow6, row type b
You will see that row type a is always converted to show the same values via tLogRow5 and tLogRow6, as is row type b.
That went on a bit longer than I expected, but hopefully it gives you a few ideas on how you can tackle this problem.
Wow thank you @Richard Hall that was extremely thorough and helpful!
So, if I were going to use this method, we could store an encrypted version of the file in the test case repository and then within the test case itself we could use a tDecrypt component and feed that flow to the components needing to be tested? The encryption/decryption key I suppose could be stored as a context variable in the test case?
Well you could use that method. You would have to be careful with the keys, etc, but it would work. I would advise you to consider recreating a safe dataset or using the Bijective Data Masking to create something that holds together (if you have multiple data sources with key data) while not revealing any sensitive data.