Skip to main content
Announcements
Introducing a new Enhanced File Management feature in Qlik Cloud! GET THE DETAILS!
cancel
Showing results for 
Search instead for 
Did you mean: 
mborlo15
Contributor III
Contributor III

Test Cases and Sensitive Data

Context: We are looking to implement test cases for our Talend jobs.

 

Concern: I noticed in the documentation that once a file is selected to be a part of the test case, it will be saved into a repository.

 

The data we process is often sensitive in nature, so our tests may often require the use of sensitive data.

 

Question: Our Talend solution is currently synced with git repository, and we avoid putting sensitive information there. Will sensitives files that we use for test cases be synced with our git repo?

Talend Product: Data Integration 7.3

0695b00000N2qP4AAJ.png

Labels (2)
1 Solution

Accepted Solutions
Anonymous
Not applicable

You shouldn't be using real data for testing unless it has been anonymised and all sensitive content removed/encrypted. Talend Studio can help in preparing this. The data you use with these tests (which are essentially unit tests) will be stored in your git repo.

View solution in original post

5 Replies
Anonymous
Not applicable

You shouldn't be using real data for testing unless it has been anonymised and all sensitive content removed/encrypted. Talend Studio can help in preparing this. The data you use with these tests (which are essentially unit tests) will be stored in your git repo.

mborlo15
Contributor III
Contributor III
Author

Hi @Richard Hall​  thank you for your response. Follow up question for you:

You mentioned that Talend Studio can assist with preparing test data. Can you elaborate on that?

Anonymous
Not applicable

There are several ways in which this can be done. An easy way, using pre-built components can be seen below. What I am doing here is encrypting values (String, double, int and boolean), then decrypting them. I show the values at each stage.....

 

0695b00000N3AxkAAF.png

  1. Here I am simply setting a fixed row of data. This can be seen in the output I will show of the job running. The values are String, double, int and boolean. These were just shown as an example.
  2. This just outputs the initial values.
  3. The tDataEncrypt component encrypts the columns you specify. I have chosen all of them in this example.
  4. I am outputting the result of the encryption.
  5. Here I am decrypting the values. It should be noted that encrypted values will always be output as a String, so when they are decrypted they are also output as a String.
  6. The tConvertType is used to convert them back to their original types.
  7. Again, I am outputting the values. This time they are decrypted and back to their original types.

 

Below is the output from each stage....

 

Initial values

Hello|2.09|234|true

 

Encrypted values

SzpQfaeOsUK1RnsJMQsJcBrWtNTI9tS0ybYz0rb0AiBuEaqH6Q==|MXvepZAWGQvRHhDuQc+sT/YcxtCxtymqVfekg/4HAd8r0crD|K3VAo7BIJ0k6gp+7vdJpTAh4S97cBdkGvwWVRExnRT15scA=|Tz1WYMAiUkh1MiJdkyF9kaSdMZpieoYHRH473A8Ubf6YzDlC

 

Decrypted values

Hello|2.09|234|true

 

Since you are encrypting data to be tested, you may find that you only want to encrypt identifying data such as Strings, etc. This may not be entirely suitable, but as I said, this is the most easy way of doing this. Another recommendation I have is to use data masking, which can be carried out in Data Preparation if you have it. This is described here....

 

https://help.talend.com/r/en-US/8.0/data-preparation-user-guide/masking-data

 

The last example is a bit more complicated to get your head around and will take more time to setup, but it is VERY good. It is called Bijective Data Masking. It is described below....

 

https://help.talend.com/r/en-US/8.0/data-privacy/data-masking-functions-in-masking-components

 

As a quick example of what this can do, it allows your values to be masked to the same values. So, if you mask an "A" to "E". the next time an "A" needs to be masked, it will be masked to "E". It works for numbers as well. So you can keep your keys in your data, meaning that you know that your data will remain consistent in how it is processed.

 

A quick example can be seen below....

 

0695b00000N3BUjAAN.png

  1. Again, with this example I am outputting a fixed data set. This time I have 4 rows (2 different rows repeated to show the consistency).
  2. This just prints out the initial unmasked rows.
  3. The tDataMasking component does what it says on the tin 🙂 The config of this can be complicated, so this is how I set it up.....0695b00000N3BW6AAN.pngIt doesn't support booleans, so I have not done anything with those. I have just noticed that my column naming is pretty poor in this quick example, so here are the types (newColumn = String; newColumn1 = double; newColumn2 = int; newColumn3 = boolean). "newColumn3 is not shown above, but its unchanged value is still passed through. 0695b00000N3BXdAAN.pngThe Advanced tab needs some config for Bijective Data Masking. I have set a password and a "Seed for random generator". You will need to look at the documentation to decide how you set yours up.
  4. The final tLogRow simply outputs the masked values.

 

The results can be seen below. I've added notes to show you which tLogRow they are output by (5 or 6) and which rows should be the same (a and b).....

 

Hello|2.873|23|true --> tLogRow5, row type a

Ozthm|2.679|19|true|false --> tLogRow6, row type a

Goodbye|233.232|876|false --> tLogRow5, row type b

Fdkdkpg|186.862|589|false|false --> tLogRow6, row type b

Hello|2.873|23|true --> tLogRow5, row type a

Ozthm|2.679|19|true|false --> tLogRow6, row type a

Goodbye|233.232|876|false --> tLogRow5, row type b

Fdkdkpg|186.862|589|false|false --> tLogRow6, row type b

 

You will see that row type a is always converted to show the same values via tLogRow5 and tLogRow6, as is row type b.

 

That went on a bit longer than I expected, but hopefully it gives you a few ideas on how you can tackle this problem.

 

 

mborlo15
Contributor III
Contributor III
Author

Wow thank you @Richard Hall​ that was extremely thorough and helpful!

 

So, if I were going to use this method, we could store an encrypted version of the file in the test case repository and then within the test case itself we could use a tDecrypt component and feed that flow to the components needing to be tested? The encryption/decryption key I suppose could be stored as a context variable in the test case?

Anonymous
Not applicable

Well you could use that method. You would have to be careful with the keys, etc, but it would work. I would advise you to consider recreating a safe dataset or using the Bijective Data Masking to create something that holds together (if you have multiple data sources with key data) while not revealing any sensitive data.