<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>article Talend Data Preparation Best Practices in Official Support Articles</title>
    <link>https://community.qlik.com/t5/Official-Support-Articles/Talend-Data-Preparation-Best-Practices/ta-p/2151657</link>
    <description>&lt;P&gt;Talend Data Preparation is a self-service application that enables you to simplify and expedite the time-consuming process of preparing data for analysis or other data-driven tasks.&lt;/P&gt;
&lt;P&gt;This article explains the best practices that Talend suggests you follow when working with Talend Data Preparation.&lt;/P&gt;
&lt;P&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Content:&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H3&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Data cataloging&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H3&gt;
&lt;P&gt;While working with large datasets, various inputs, and large teams, it is important to classify datasets and preparations. Talend recommends the following best practices to categorize the artifacts.&lt;/P&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Follow naming conventions&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;Although naming conventions depend on the person or organization, following naming conventions makes it significantly easier for subsequent generations to understand what the system is doing and how to fix or extend the source code for new business needs. While working with Data Preparation, the best practice is to follow the agreed naming standards for the folders, preparations, datasets, and contexts variables.&lt;/P&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Folders&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;Use the following guidelines to name folders for Preparations:&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Use camel case&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Separate with underscores&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Do not use whitespace&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Use only alphanumeric characters&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Avoid general folder names&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Avoid short forms&lt;/P&gt;
&amp;nbsp;
&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFfqAAE.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/122392i3170C8F2EEC8B272/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFfqAAE.jpg" alt="0693p000008uFfqAAE.jpg" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Preparations and datasets&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;Preparations and datasets are typically local to a project, so you can set their naming conventions either globally at the organization level, or locally at the project level. Ensure that the naming conventions are strictly followed. Some guidelines are:&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Extracted source name&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Prefix or suffix dataset extracted date&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Business usage&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Rules applied&lt;/P&gt;
&amp;nbsp;
&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFjlAAE.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/123089i970B1ADD977C4454/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFjlAAE.jpg" alt="0693p000008uFjlAAE.jpg" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&lt;/P&gt;
&amp;nbsp;
&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFcWAAU.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/125189i63414D83BF2328F4/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFcWAAU.jpg" alt="0693p000008uFcWAAU.jpg" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Context variables&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;Guidelines for using context variables while calling data preparations from Talend Data Integration or Big Data Jobs are:&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Create additional contexts for project-specific requirements&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Limit the number of additional contexts you create to less than three new contexts per project&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Instead, opt for a common context group&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Context variables must be descriptive&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Avoid one-character context variables, for example, &lt;STRONG&gt;a&lt;/STRONG&gt;, &lt;STRONG&gt;b&lt;/STRONG&gt;, &lt;STRONG&gt;c&lt;/STRONG&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Avoid generic names like &lt;STRONG&gt;var1&lt;/STRONG&gt; or &lt;STRONG&gt;var2&lt;/STRONG&gt;&lt;/P&gt;
&amp;nbsp;
&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFjqAAE.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/122061i8A58CA4E2E2A2B31/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFjqAAE.jpg" alt="0693p000008uFjqAAE.jpg" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&lt;/P&gt;
&amp;nbsp;
&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFauAAE.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/123114iBEB5033D99DC1AA3/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFauAAE.jpg" alt="0693p000008uFauAAE.jpg" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Define and follow folder structures&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;Use folder structures to group items of similar categories or behaviors. As this is completely related to individual products, Talend recommends that you define the folder structures in the project's initial phases. Figure 6 is an example of a folder structure followed in a bank. The folders are divided according to the unit of module. Group datasets by:&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Business modules&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Sources&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Rules applied&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Intake areas&lt;/P&gt;
&amp;nbsp;
&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFjvAAE.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/124865i2A12CE717561F7A7/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFjvAAE.jpg" alt="0693p000008uFjvAAE.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 6: A bank user following folder structures as business use case (CreditCard_Defaulters)&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Data discovery and profiling&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H3&gt;
&lt;P&gt;Data profiling and data discovery allow you to analyze and identify the relationships between your data. This section explains some of the best practices for discovering and profiling data.&lt;/P&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Pick the right data&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;Picking the right data is about finding the data best suited for a specific purpose. It is important to note that this should not only be about finding the data you need right now, but it should also make it easier to find data later, when similar needs arise. Best practices for picking the right data are:&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Explore and find the data best suited for a specific purpose&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Avoid data with multiple nulls or same/repeated values&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Select values close to the source - avoid calculated or derived values&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Avoid intermediate values&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Extract data across multiple platforms&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Determine data suitability (for example, discovery, reporting, monitoring, and decision making)&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Filter data to select a subject that meets the rules and conditions&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Know the source of the data so that you can source it repeatedly&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Figure 7 shows some guidelines for what to avoid while picking the right data. This sample dataset of 10,000 employee income records has multiple null values, negative values for defaulters, and repeating names and addresses. This data does not look good, and thus should be discarded. Bring in additional sample data to ensure you are picking the right data.&lt;/P&gt;
&lt;P class="lia-indent-padding-left-30px"&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFhWAAU.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/123183i65FBE3294A06160D/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFhWAAU.jpg" alt="0693p000008uFhWAAU.jpg" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Understand the data&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;Understanding data is essential in assessing data quality and accuracy. It is also important to check how the data fits with governance rules and policies. Once you understand the data, you can determine the right level of quality for the data. Best practices for understanding the data are:&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Learn data, file, and database formats&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Use visualization capabilities to examine the current state of the data&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Spot irregularities and inconsistencies in the data&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Use profiling to generate data quality metrics and statistical analysis of the data&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Understand the limitation of the data&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;As highlighted below, Talend Data Preparation assists in the process of understanding data.&lt;/P&gt;
&lt;P class="lia-indent-padding-left-30px"&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFk0AAE.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/122312iDAE0F2CC34774D31/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFk0AAE.jpg" alt="0693p000008uFk0AAE.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 8: Data Preparation showing the different data patterns and the valid and invalid record percentage&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Verify data types and formats&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;Data preparation always starts with a raw data file, which comes in many shapes and sizes. Mainframe data is different than PC data, spreadsheet data is formatted differently than web data, and so forth. In the age of big data, there is a lot of variance in source files.&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Make sure that you can read the files in the correct format.&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Ensure that the data types used are accurate. You need to look at what each field contains. For example, it is good to check that if a file is listed as a number, it contains a number, not the phone number or postal code. Likewise, a character file should not contain all numeric data.&lt;/P&gt;
&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;Data preparation shows the successful read input with the data type as shown in Figure 9.&lt;/P&gt;
&lt;P class="lia-indent-padding-left-30px"&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFk5AAE.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/125190i3D5F969A5E48999F/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFk5AAE.jpg" alt="0693p000008uFk5AAE.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 9: Data Preparation showing the data types for the input data&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Note&lt;/STRONG&gt;: by using a &lt;A href="#data dictionary" target="_self"&gt;data dictionary&lt;/A&gt;, you can set the type needed for every column.&lt;/P&gt;
&lt;H3&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Data integration&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H3&gt;
&lt;P&gt;Data integration involves combining data residing in different sources and providing users with a unified view of them. Talend Data Preparation provides a platform where you can integrate data while discovering and profiling. This section explains some of the best practices to keep in mind while integrating data.&lt;/P&gt;
&lt;H4&gt;&lt;STRONG&gt;&lt;FONT color="#339966"&gt;Improve the data&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/H4&gt;
&lt;P&gt;Once you have assessed the data's quality and accuracy, and have determined the right level of quality for the purpose of the data, as a best practice you must improve the data by:&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Cleansing the data&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Noting missing data&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Performing identity resolution&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Refining and merging-purging the data&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Data Preparation offers numerous functions for improving the data as shown in Figure 10.&lt;/P&gt;
&lt;P class="lia-indent-padding-left-30px"&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFSvAAM.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/123453i2E2368E4E1795BE3/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFSvAAM.jpg" alt="0693p000008uFSvAAM.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 10: Data Preparation offers numerous functions for improving the data&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Integrate the data&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;A powerful feature of Data Preparation is the ability to integrate datasets. This takes data preparation to the next level, as now a business can perform simple joins and lookups while preparing rules. As a best practice, integrate data to suit the following needs:&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Validating new sources&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Integrating and blending data with data from other sources&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Restructuring the data according to the needed format for business intelligence, integration, blending, and analysis&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Transposing the data&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;The following screenshot is an example of combining two datasets in Data Preparation.&lt;/P&gt;
&lt;P class="lia-indent-padding-left-30px"&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFbiAAE.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/122212iA3F0AC451053122B/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFbiAAE.jpg" alt="0693p000008uFbiAAE.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 11: Combining two datasets in Data Preparation&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H3&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Data cleansing, standardizing, and shaping&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H3&gt;
&lt;P&gt;The following best practices describe the techniques to keep in mind while cleansing, standardizing, and shaping data.&lt;/P&gt;
&lt;H4&gt;&lt;STRONG&gt;&lt;FONT color="#339966"&gt;Transform the data&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/H4&gt;
&lt;P&gt;Talend Data Preparation is a powerful tool enabling business users to transform their data. Most of the simple yet important transformations can now be applied with simple clicks. As a best practice, Talend recommends:&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Creating generalized rules to transform data&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Applying transformation functions to structured and unstructured data&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Enriching and completing the data&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Determining the levels of aggregation needed to answer business questions&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Using filters to tailor data for reports or analysis&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Incorporating formulas for manipulation requirements&lt;/P&gt;
&amp;nbsp;
&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFkFAAU.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/122202i55F505A968E5C2E2/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFkFAAU.jpg" alt="0693p000008uFkFAAU.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 12: Data transformations applied to a dataset&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Verify data accuracy&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;While making preparations, ensure that the data is accurate and that it makes sense. This is quite an important step and requires some knowledge of the subject area that the dataset is related to. There is not a specific approach to verifying data accuracy.&lt;/P&gt;
&lt;P&gt;The basic idea is to formulate some properties that you think the data should exhibit, and test the data to see if those properties are satisfied. Essentially, you are trying to figure out whether the data really is what you have been told it is. In this example, the ID always has to be an 18-digit number, so there is a preparation to validate the ID length.&lt;/P&gt;
&lt;P class="lia-indent-padding-left-30px"&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFbxAAE.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/123327iC50A827551FD0395/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFbxAAE.jpg" alt="0693p000008uFbxAAE.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 13: An example of functions written to verify data accuracy&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Identify outliers&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;Outliers are problematic because they can severely compromise the outcome. For example, a single outlier can have a significant impact on the value of the mean, because the mean is supposed to represent the center of the data. In a sense, this one outlier renders the mean useless.&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;
&lt;P&gt;Outliers are data points that are distant from the rest of the distribution. They are either very large or very small values compared with the rest of the dataset.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;When faced with outliers, the most common strategy is to delete them. However, it depends on the individual project requirements.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Talend Data Preparation identifies the outliers by making it easier for the following functions to be applied, as shown in Figure 14.&lt;/P&gt;
&lt;P class="lia-indent-padding-left-30px"&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFYFAA2.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/124460i451104C40653C24C/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFYFAA2.jpg" alt="0693p000008uFYFAA2.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 14: Quick identification of outliers in Data Preparation&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Data enrichment&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H3&gt;
&lt;P&gt;Data enrichment is a value adding process; this process provides more information about the data to the customer. Use the methods given below to enrich data.&lt;/P&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Deal with missing values&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;Missing values can cause a potential risk to the data being analyzed. They are probably one of the most common data problems you will encounter. As a best practice, Talend recommends that you resolve the missing values. The method depends on the project, but you can:&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Replace the missing values with an appropriate value&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Replace them with a flag to indicate a blank&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Delete the row/record&lt;/P&gt;
&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFkKAAU.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/124389i202EA58545095420/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFkKAAU.jpg" alt="0693p000008uFkKAAU.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 15: Dealing with missing values&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Share and reuse preparations&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;Reusability is the best reward in the coding world. It saves a lot of time and effort and makes the whole software development lifecycle easier. With Talend Data Preparation, you can share the preparations and datasets with individual users, or with a group of users. Best practices include:&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Sharing and reusing data preparations&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Placing the shareable preparation in a shared folder, thereby enabling collaborative work&lt;/P&gt;
&amp;nbsp;
&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFfbAAE.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/123635iC4C95E623160BA3F/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFfbAAE.jpg" alt="0693p000008uFfbAAE.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 16: Data Preparation options to share the folder&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H3&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Security&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H3&gt;
&lt;P&gt;Follow the methods given below to secure data while working with Talend Data Preparation.&lt;/P&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Protect data&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;As a best practice, masking is an excellent way to protect sensitive data such as names, addresses, credit cards, or social security numbers. To protect the original data while having a functional substitute, you can use the &lt;STRONG&gt;Mask data (obfuscation)&lt;/STRONG&gt; function.&lt;/P&gt;
&lt;P class="lia-indent-padding-left-30px"&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFjSAAU.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/124714iB5EB5DDB856CA675/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFjSAAU.jpg" alt="0693p000008uFjSAAU.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 17: Masking function available in Talend Data Preparation&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H4&gt;&lt;STRONG&gt;&lt;FONT color="#339966"&gt;Preparation versioning&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/H4&gt;
&lt;P&gt;Adding versions to your preparation is an excellent way to see the differences that have been made to the preparation over time, but they also ensure that it is always the same state of a preparation that is used in Talend Jobs. Even if the preparation is still being worked on, versions can be used in Data Integration as well as Big Data Jobs.&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;
&lt;P&gt;Capture the state of your preparation by creating a version, as shown in Figure 18.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Preparation versions are propagated when sharing or moving a preparation across your folder structure, but not when you copy it or apply it to a new dataset.&lt;/P&gt;
&amp;nbsp;
&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFSDAA2.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/122151i3585E3C1568E3E7E/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFSDAA2.jpg" alt="0693p000008uFSDAA2.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 18: Versioning in Data Preparation&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Change where log files are stored&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;Talend Data Preparation logs allow you to analyze and debug the activity of Talend Data Preparation. By default, Talend Data Preparation logs in two different places: in the console and a log file. The location of this log file depends on the version of Talend Data Preparation that you are using:&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;Data_Preparation_Path&lt;/EM&gt;/data/logs/app.log&lt;/STRONG&gt; for Talend Data Preparation&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;&lt;STRONG&gt;AppData/Roaming/Talend/dataprep/logs/app.log&lt;/STRONG&gt; for Talend Data Preparation Free Desktop on Windows&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;&lt;STRONG&gt;Library/Application Support/Talend/dataprep/logs/app.log&lt;/STRONG&gt; for Talend Data Preparation Free Desktop on MacOS&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&amp;nbsp;As a best practice, Talend recommends that you change the default location of the log file, which can be configured by editing the &lt;STRONG&gt;logging.file&lt;/STRONG&gt; property of the &lt;STRONG&gt;application.properties&lt;/STRONG&gt; file.&lt;/P&gt;
&lt;H4&gt;&lt;STRONG&gt;&lt;FONT color="#339966"&gt;Understand where your data is stored&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/H4&gt;
&lt;P&gt;Your data is stored in different locations, depending on the version of Talend Data Preparation you are using.&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Talend Data Preparation&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;If you are a subscription user, nothing is saved directly on your computer.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Sample data is cached temporarily on the remote Talend Data Preparation server, to improve the product responsiveness. In addition, CSV and Excel datasets are stored permanently on the remote Talend Data Preparation server.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Talend Data Preparation Free Desktop is meant to work locally on your computer, without the need of an internet connection. Therefore, when using a dataset from a local file such as a CSV or Excel file, the data is copied locally to one of the following folders, depending on your operating system:&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Windows: &lt;STRONG&gt;C:\Users\&lt;EM&gt;your_user_name&lt;/EM&gt;\AppData\Roaming\Talend\dataprep\store&lt;/STRONG&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;OS X: &lt;STRONG&gt;/Users/&lt;EM&gt;your_user_name&lt;/EM&gt;/Library/Application Support/Talend/dataprep/store&lt;/STRONG&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Center of excellence&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H3&gt;
&lt;P&gt;A center of excellence is a group or team that leads other employees and the organization as a whole in some particular area of focus such as a technology, skill, or discipline. As a best practice, build a center of excellence as suggested below.&lt;/P&gt;
&lt;H4&gt;&lt;STRONG&gt;&lt;FONT color="#339966"&gt;Build knowledge&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/H4&gt;
&lt;P&gt;As you deal with raw data, Talend recommends that you build knowledge while you analyze the data. You can:&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Discover and learn data relationships within and across sources, and find out how the data fits together&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Use analytics to discover patterns&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Define the data by collaborating with other business users to define shared rules, business policies, and ownership&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Build knowledge with a catalog, glossary, or metadata repository&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Gain high-level insights to get the big picture of the data and its context&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4&gt;&lt;STRONG&gt;&lt;FONT color="#339966"&gt;Document knowledge&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/H4&gt;
&lt;P&gt;While it is important to build and enhance your knowledge, it is equally important to document the gained knowledge. In particular, every project must maintain a document for:&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Business terminology&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Source data lineage&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;History of changes applied during cleansing&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Relationships to other data&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Data usage recommendations&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Associated data governance policies&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Identified data stewards&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Create a data dictionary&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;As you analyze and understand your data, Talend recommends that you store it in a data dictionary. This helps other users identify the data they are working with, and establish the relationships between various data.&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;A data dictionary is a metadata description of the features included in the dataset&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;In Figure 19, the input file has a column language. At the onset when the input is read, the columns with two languages are marked as invalid.&lt;/P&gt;
&amp;nbsp;
&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFkUAAU.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/123214i9923CC5A8B3C18CB/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFkUAAU.jpg" alt="0693p000008uFkUAAU.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 19: Input file with an invalid language column&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Using the data dictionary, when you change the metadata to accept more than one language as valid input, Data Preparation shows it as a valid record.&lt;/P&gt;
&amp;nbsp;
&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFVgAAM.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/124620i623A1C56A0B7AC02/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFVgAAM.jpg" alt="0693p000008uFVgAAM.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 20: Data dictionary in Data Preparation&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4&gt;&lt;STRONG&gt;&lt;FONT color="#339966"&gt;Back up&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/H4&gt;
&lt;P&gt;Backing up Talend Data Preparation and the Talend Data Dictionary on a regular basis is important to ensure you can recover from a data loss scenario, or any other causes of data corruption or deletion.&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Data Preparation&lt;BR /&gt;&lt;BR /&gt;To create a copy of the Talend Data Preparation instance, back up MongoDB, the folders containing your data, the configuration files, and the logs.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Data Dictionary&lt;/P&gt;
Talend Dictionary Service stores all the predefined semantic types used in Talend Data Preparation. It also stores all the custom types created by users, and all the modifications done on existing types.&lt;BR /&gt;&lt;BR /&gt;To back up a Talend Dictionary Service instance, back up MongoDB, and the changes made to the predefined semantic types.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H4&gt;&lt;STRONG&gt;&lt;FONT color="#339966"&gt;Operationalizing&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/H4&gt;
&lt;P&gt;Talend Data Preparation lets you operationalize the recipes you will use in Talend Studio. This section covers the best practices for operationalizing.&lt;/P&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Promote preparations between environments&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;The best practice when using Talend Data Preparation is to set up one instance for each environment of your production chain.&lt;/P&gt;
&lt;P&gt;Talend only supports promoting a preparation between identical product versions. To promote a preparation from one environment to the other, you have to export it from the source environment, then import it back to your target environment. For the import to work, a dataset with the same name and schema as the one that the export was based on must exist on the target environment.&lt;/P&gt;
&lt;H4&gt;&lt;STRONG&gt;&lt;FONT color="#339966"&gt;Hybrid preparation environments&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/H4&gt;
&lt;P&gt;Sometimes the transformations are either too complex or too bulky to be created in a simple form. To help you in such scenarios, Talend offers a hybrid preparation environment. As a best practice, leverage Studio to create real time datasets, and use these datasets for preparations.&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;You could use either the dedicated Talend preparation service or Talend Jobs to create data preparations&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Leverage the &lt;STRONG&gt;tDatasetOutput&lt;/STRONG&gt; component for output in &lt;STRONG&gt;Create&lt;/STRONG&gt; mode&lt;/P&gt;
&amp;nbsp;
&lt;P&gt;Figure 21 shows the &lt;STRONG&gt;tDatasetOutput&lt;/STRONG&gt; component properties:&lt;/P&gt;
&amp;nbsp;
&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFU3AAM.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/122401i5CA1B381814802AC/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFU3AAM.jpg" alt="0693p000008uFU3AAM.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 21: tDatasetOutput component properties&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;Running the Job creates the dataset in Talend Data Preparation as shown below.&lt;/P&gt;
&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFkjAAE.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/122734iD65F7C3AFCE791BB/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFkjAAE.jpg" alt="0693p000008uFkjAAE.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 22: Run the Job and create the dataset&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Operationalize a recipe in a Talend Job&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;The &lt;STRONG&gt;tDataprepRun&lt;/STRONG&gt; component allows you to reuse an existing preparation, made in Talend Data Preparation, directly in a Data Integration Job. In other words, you can operationalize the process of applying a preparation to input files that have the same model.&lt;/P&gt;
&lt;H3&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Using a recipe as part of a Data Integration flow, or a Talend Spark Batch or Streaming Job in Talend Studio&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H3&gt;
&lt;P&gt;&amp;nbsp;The figure below shows the usage of a preparation/recipe in a Talend Job.&lt;/P&gt;
&lt;P class="lia-indent-padding-left-30px"&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFjTAAU.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/124091i6BE409B9E2D7B989/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFjTAAU.jpg" alt="0693p000008uFjTAAU.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 23: Using a preparation/recipe in a Talend Job&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;You can select a specific preparation as shown below.&lt;/P&gt;
&lt;P class="lia-indent-padding-left-30px"&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFROAA2.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/122163i0D11B01FF995079F/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFROAA2.jpg" alt="0693p000008uFROAA2.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 24: Select a specific Preparation&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;Or you can specify a dynamic preparation as shown in Figure 25. By using a dynamic preparation with context variables, you could build a single Job template to use across projects/organizations.&lt;/P&gt;
&lt;P class="lia-indent-padding-left-30px"&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFktAAE.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/123915i96B71ABFB6218B3C/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFktAAE.jpg" alt="0693p000008uFktAAE.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 25: Dynamic preparation selection&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;STRONG&gt;Note&lt;/STRONG&gt;: To use the &lt;STRONG&gt;tDataprepRun&lt;/STRONG&gt; component with Talend Data Preparation Cloud, you must have the 6.4.1 version of Talend Studio installed.&lt;/P&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Creating a live dataset&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;What if your business does not need sampling, but needs real live data for analysis? Because the Job is designed in Talend Studio, you can take advantage of the full palette of components and their Data Quality or Big Data capabilities. Unlike a local file import, where the data is stored in the Talend Data Preparation server for as long as the file exists, a live dataset only retrieves this sample data temporarily.&lt;/P&gt;
&lt;P&gt;It is possible to retrieve the result of Talend Cloud flows that were executed on a Talend Cloud engine, as well as on remote engines.&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Use a preparation as part of a Data Integration flow, or a Talend Spark Batch or Streaming Job in Talend Studio.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;The live dataset feature allows you to create a Job in Talend Studio, execute it on demand using Talend Cloud as a flow, and retrieve a dataset with the sample data directly in Talend Data Preparation Cloud.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;The screenshots below show an example of a Job creating a live dataset:&lt;/P&gt;
&lt;P class="lia-indent-padding-left-30px"&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFkyAAE.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/124533i00BE08981F1B4840/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFkyAAE.jpg" alt="0693p000008uFkyAAE.jpg" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P class="lia-indent-padding-left-30px"&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFkaAAE.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/123068iC94261667A863AFF/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFkaAAE.jpg" alt="0693p000008uFkaAAE.jpg" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P class="lia-indent-padding-left-30px"&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFR4AAM.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/123955i77A12DD964D7BBD4/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFR4AAM.jpg" alt="0693p000008uFR4AAM.jpg" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Note&lt;/STRONG&gt;: To create live datasets, you must have the 6.4.1 version of Talend Studio installed, patched with at least the 0.19.3 version of the Talend Data Preparation components.&lt;/P&gt;</description>
    <pubDate>Tue, 23 Jan 2024 02:35:30 GMT</pubDate>
    <dc:creator>TalendSolutionExpert</dc:creator>
    <dc:date>2024-01-23T02:35:30Z</dc:date>
    <item>
      <title>Talend Data Preparation Best Practices</title>
      <link>https://community.qlik.com/t5/Official-Support-Articles/Talend-Data-Preparation-Best-Practices/ta-p/2151657</link>
      <description>&lt;P&gt;Talend Data Preparation is a self-service application that enables you to simplify and expedite the time-consuming process of preparing data for analysis or other data-driven tasks.&lt;/P&gt;
&lt;P&gt;This article explains the best practices that Talend suggests you follow when working with Talend Data Preparation.&lt;/P&gt;
&lt;P&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Content:&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H3&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Data cataloging&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H3&gt;
&lt;P&gt;While working with large datasets, various inputs, and large teams, it is important to classify datasets and preparations. Talend recommends the following best practices to categorize the artifacts.&lt;/P&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Follow naming conventions&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;Although naming conventions depend on the person or organization, following naming conventions makes it significantly easier for subsequent generations to understand what the system is doing and how to fix or extend the source code for new business needs. While working with Data Preparation, the best practice is to follow the agreed naming standards for the folders, preparations, datasets, and contexts variables.&lt;/P&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Folders&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;Use the following guidelines to name folders for Preparations:&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Use camel case&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Separate with underscores&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Do not use whitespace&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Use only alphanumeric characters&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Avoid general folder names&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Avoid short forms&lt;/P&gt;
&amp;nbsp;
&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFfqAAE.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/122392i3170C8F2EEC8B272/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFfqAAE.jpg" alt="0693p000008uFfqAAE.jpg" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Preparations and datasets&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;Preparations and datasets are typically local to a project, so you can set their naming conventions either globally at the organization level, or locally at the project level. Ensure that the naming conventions are strictly followed. Some guidelines are:&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Extracted source name&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Prefix or suffix dataset extracted date&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Business usage&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Rules applied&lt;/P&gt;
&amp;nbsp;
&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFjlAAE.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/123089i970B1ADD977C4454/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFjlAAE.jpg" alt="0693p000008uFjlAAE.jpg" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&lt;/P&gt;
&amp;nbsp;
&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFcWAAU.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/125189i63414D83BF2328F4/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFcWAAU.jpg" alt="0693p000008uFcWAAU.jpg" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Context variables&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;Guidelines for using context variables while calling data preparations from Talend Data Integration or Big Data Jobs are:&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Create additional contexts for project-specific requirements&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Limit the number of additional contexts you create to less than three new contexts per project&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Instead, opt for a common context group&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Context variables must be descriptive&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Avoid one-character context variables, for example, &lt;STRONG&gt;a&lt;/STRONG&gt;, &lt;STRONG&gt;b&lt;/STRONG&gt;, &lt;STRONG&gt;c&lt;/STRONG&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Avoid generic names like &lt;STRONG&gt;var1&lt;/STRONG&gt; or &lt;STRONG&gt;var2&lt;/STRONG&gt;&lt;/P&gt;
&amp;nbsp;
&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFjqAAE.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/122061i8A58CA4E2E2A2B31/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFjqAAE.jpg" alt="0693p000008uFjqAAE.jpg" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&lt;/P&gt;
&amp;nbsp;
&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFauAAE.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/123114iBEB5033D99DC1AA3/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFauAAE.jpg" alt="0693p000008uFauAAE.jpg" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Define and follow folder structures&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;Use folder structures to group items of similar categories or behaviors. As this is completely related to individual products, Talend recommends that you define the folder structures in the project's initial phases. Figure 6 is an example of a folder structure followed in a bank. The folders are divided according to the unit of module. Group datasets by:&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Business modules&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Sources&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Rules applied&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Intake areas&lt;/P&gt;
&amp;nbsp;
&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFjvAAE.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/124865i2A12CE717561F7A7/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFjvAAE.jpg" alt="0693p000008uFjvAAE.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 6: A bank user following folder structures as business use case (CreditCard_Defaulters)&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Data discovery and profiling&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H3&gt;
&lt;P&gt;Data profiling and data discovery allow you to analyze and identify the relationships between your data. This section explains some of the best practices for discovering and profiling data.&lt;/P&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Pick the right data&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;Picking the right data is about finding the data best suited for a specific purpose. It is important to note that this should not only be about finding the data you need right now, but it should also make it easier to find data later, when similar needs arise. Best practices for picking the right data are:&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Explore and find the data best suited for a specific purpose&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Avoid data with multiple nulls or same/repeated values&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Select values close to the source - avoid calculated or derived values&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Avoid intermediate values&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Extract data across multiple platforms&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Determine data suitability (for example, discovery, reporting, monitoring, and decision making)&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Filter data to select a subject that meets the rules and conditions&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Know the source of the data so that you can source it repeatedly&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Figure 7 shows some guidelines for what to avoid while picking the right data. This sample dataset of 10,000 employee income records has multiple null values, negative values for defaulters, and repeating names and addresses. This data does not look good, and thus should be discarded. Bring in additional sample data to ensure you are picking the right data.&lt;/P&gt;
&lt;P class="lia-indent-padding-left-30px"&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFhWAAU.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/123183i65FBE3294A06160D/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFhWAAU.jpg" alt="0693p000008uFhWAAU.jpg" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Understand the data&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;Understanding data is essential in assessing data quality and accuracy. It is also important to check how the data fits with governance rules and policies. Once you understand the data, you can determine the right level of quality for the data. Best practices for understanding the data are:&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Learn data, file, and database formats&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Use visualization capabilities to examine the current state of the data&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Spot irregularities and inconsistencies in the data&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Use profiling to generate data quality metrics and statistical analysis of the data&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Understand the limitation of the data&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;As highlighted below, Talend Data Preparation assists in the process of understanding data.&lt;/P&gt;
&lt;P class="lia-indent-padding-left-30px"&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFk0AAE.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/122312iDAE0F2CC34774D31/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFk0AAE.jpg" alt="0693p000008uFk0AAE.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 8: Data Preparation showing the different data patterns and the valid and invalid record percentage&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Verify data types and formats&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;Data preparation always starts with a raw data file, which comes in many shapes and sizes. Mainframe data is different than PC data, spreadsheet data is formatted differently than web data, and so forth. In the age of big data, there is a lot of variance in source files.&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Make sure that you can read the files in the correct format.&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Ensure that the data types used are accurate. You need to look at what each field contains. For example, it is good to check that if a file is listed as a number, it contains a number, not the phone number or postal code. Likewise, a character file should not contain all numeric data.&lt;/P&gt;
&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;Data preparation shows the successful read input with the data type as shown in Figure 9.&lt;/P&gt;
&lt;P class="lia-indent-padding-left-30px"&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFk5AAE.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/125190i3D5F969A5E48999F/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFk5AAE.jpg" alt="0693p000008uFk5AAE.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 9: Data Preparation showing the data types for the input data&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Note&lt;/STRONG&gt;: by using a &lt;A href="#data dictionary" target="_self"&gt;data dictionary&lt;/A&gt;, you can set the type needed for every column.&lt;/P&gt;
&lt;H3&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Data integration&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H3&gt;
&lt;P&gt;Data integration involves combining data residing in different sources and providing users with a unified view of them. Talend Data Preparation provides a platform where you can integrate data while discovering and profiling. This section explains some of the best practices to keep in mind while integrating data.&lt;/P&gt;
&lt;H4&gt;&lt;STRONG&gt;&lt;FONT color="#339966"&gt;Improve the data&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/H4&gt;
&lt;P&gt;Once you have assessed the data's quality and accuracy, and have determined the right level of quality for the purpose of the data, as a best practice you must improve the data by:&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Cleansing the data&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Noting missing data&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Performing identity resolution&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Refining and merging-purging the data&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Data Preparation offers numerous functions for improving the data as shown in Figure 10.&lt;/P&gt;
&lt;P class="lia-indent-padding-left-30px"&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFSvAAM.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/123453i2E2368E4E1795BE3/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFSvAAM.jpg" alt="0693p000008uFSvAAM.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 10: Data Preparation offers numerous functions for improving the data&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Integrate the data&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;A powerful feature of Data Preparation is the ability to integrate datasets. This takes data preparation to the next level, as now a business can perform simple joins and lookups while preparing rules. As a best practice, integrate data to suit the following needs:&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Validating new sources&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Integrating and blending data with data from other sources&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Restructuring the data according to the needed format for business intelligence, integration, blending, and analysis&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Transposing the data&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;The following screenshot is an example of combining two datasets in Data Preparation.&lt;/P&gt;
&lt;P class="lia-indent-padding-left-30px"&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFbiAAE.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/122212iA3F0AC451053122B/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFbiAAE.jpg" alt="0693p000008uFbiAAE.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 11: Combining two datasets in Data Preparation&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H3&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Data cleansing, standardizing, and shaping&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H3&gt;
&lt;P&gt;The following best practices describe the techniques to keep in mind while cleansing, standardizing, and shaping data.&lt;/P&gt;
&lt;H4&gt;&lt;STRONG&gt;&lt;FONT color="#339966"&gt;Transform the data&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/H4&gt;
&lt;P&gt;Talend Data Preparation is a powerful tool enabling business users to transform their data. Most of the simple yet important transformations can now be applied with simple clicks. As a best practice, Talend recommends:&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Creating generalized rules to transform data&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Applying transformation functions to structured and unstructured data&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Enriching and completing the data&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Determining the levels of aggregation needed to answer business questions&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Using filters to tailor data for reports or analysis&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Incorporating formulas for manipulation requirements&lt;/P&gt;
&amp;nbsp;
&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFkFAAU.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/122202i55F505A968E5C2E2/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFkFAAU.jpg" alt="0693p000008uFkFAAU.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 12: Data transformations applied to a dataset&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Verify data accuracy&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;While making preparations, ensure that the data is accurate and that it makes sense. This is quite an important step and requires some knowledge of the subject area that the dataset is related to. There is not a specific approach to verifying data accuracy.&lt;/P&gt;
&lt;P&gt;The basic idea is to formulate some properties that you think the data should exhibit, and test the data to see if those properties are satisfied. Essentially, you are trying to figure out whether the data really is what you have been told it is. In this example, the ID always has to be an 18-digit number, so there is a preparation to validate the ID length.&lt;/P&gt;
&lt;P class="lia-indent-padding-left-30px"&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFbxAAE.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/123327iC50A827551FD0395/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFbxAAE.jpg" alt="0693p000008uFbxAAE.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 13: An example of functions written to verify data accuracy&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Identify outliers&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;Outliers are problematic because they can severely compromise the outcome. For example, a single outlier can have a significant impact on the value of the mean, because the mean is supposed to represent the center of the data. In a sense, this one outlier renders the mean useless.&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;
&lt;P&gt;Outliers are data points that are distant from the rest of the distribution. They are either very large or very small values compared with the rest of the dataset.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;When faced with outliers, the most common strategy is to delete them. However, it depends on the individual project requirements.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Talend Data Preparation identifies the outliers by making it easier for the following functions to be applied, as shown in Figure 14.&lt;/P&gt;
&lt;P class="lia-indent-padding-left-30px"&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFYFAA2.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/124460i451104C40653C24C/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFYFAA2.jpg" alt="0693p000008uFYFAA2.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 14: Quick identification of outliers in Data Preparation&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Data enrichment&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H3&gt;
&lt;P&gt;Data enrichment is a value adding process; this process provides more information about the data to the customer. Use the methods given below to enrich data.&lt;/P&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Deal with missing values&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;Missing values can cause a potential risk to the data being analyzed. They are probably one of the most common data problems you will encounter. As a best practice, Talend recommends that you resolve the missing values. The method depends on the project, but you can:&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Replace the missing values with an appropriate value&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Replace them with a flag to indicate a blank&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Delete the row/record&lt;/P&gt;
&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFkKAAU.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/124389i202EA58545095420/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFkKAAU.jpg" alt="0693p000008uFkKAAU.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 15: Dealing with missing values&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Share and reuse preparations&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;Reusability is the best reward in the coding world. It saves a lot of time and effort and makes the whole software development lifecycle easier. With Talend Data Preparation, you can share the preparations and datasets with individual users, or with a group of users. Best practices include:&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Sharing and reusing data preparations&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Placing the shareable preparation in a shared folder, thereby enabling collaborative work&lt;/P&gt;
&amp;nbsp;
&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFfbAAE.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/123635iC4C95E623160BA3F/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFfbAAE.jpg" alt="0693p000008uFfbAAE.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 16: Data Preparation options to share the folder&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H3&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Security&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H3&gt;
&lt;P&gt;Follow the methods given below to secure data while working with Talend Data Preparation.&lt;/P&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Protect data&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;As a best practice, masking is an excellent way to protect sensitive data such as names, addresses, credit cards, or social security numbers. To protect the original data while having a functional substitute, you can use the &lt;STRONG&gt;Mask data (obfuscation)&lt;/STRONG&gt; function.&lt;/P&gt;
&lt;P class="lia-indent-padding-left-30px"&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFjSAAU.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/124714iB5EB5DDB856CA675/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFjSAAU.jpg" alt="0693p000008uFjSAAU.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 17: Masking function available in Talend Data Preparation&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H4&gt;&lt;STRONG&gt;&lt;FONT color="#339966"&gt;Preparation versioning&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/H4&gt;
&lt;P&gt;Adding versions to your preparation is an excellent way to see the differences that have been made to the preparation over time, but they also ensure that it is always the same state of a preparation that is used in Talend Jobs. Even if the preparation is still being worked on, versions can be used in Data Integration as well as Big Data Jobs.&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;
&lt;P&gt;Capture the state of your preparation by creating a version, as shown in Figure 18.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Preparation versions are propagated when sharing or moving a preparation across your folder structure, but not when you copy it or apply it to a new dataset.&lt;/P&gt;
&amp;nbsp;
&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFSDAA2.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/122151i3585E3C1568E3E7E/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFSDAA2.jpg" alt="0693p000008uFSDAA2.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 18: Versioning in Data Preparation&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Change where log files are stored&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;Talend Data Preparation logs allow you to analyze and debug the activity of Talend Data Preparation. By default, Talend Data Preparation logs in two different places: in the console and a log file. The location of this log file depends on the version of Talend Data Preparation that you are using:&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;Data_Preparation_Path&lt;/EM&gt;/data/logs/app.log&lt;/STRONG&gt; for Talend Data Preparation&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;&lt;STRONG&gt;AppData/Roaming/Talend/dataprep/logs/app.log&lt;/STRONG&gt; for Talend Data Preparation Free Desktop on Windows&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;&lt;STRONG&gt;Library/Application Support/Talend/dataprep/logs/app.log&lt;/STRONG&gt; for Talend Data Preparation Free Desktop on MacOS&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&amp;nbsp;As a best practice, Talend recommends that you change the default location of the log file, which can be configured by editing the &lt;STRONG&gt;logging.file&lt;/STRONG&gt; property of the &lt;STRONG&gt;application.properties&lt;/STRONG&gt; file.&lt;/P&gt;
&lt;H4&gt;&lt;STRONG&gt;&lt;FONT color="#339966"&gt;Understand where your data is stored&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/H4&gt;
&lt;P&gt;Your data is stored in different locations, depending on the version of Talend Data Preparation you are using.&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Talend Data Preparation&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;If you are a subscription user, nothing is saved directly on your computer.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Sample data is cached temporarily on the remote Talend Data Preparation server, to improve the product responsiveness. In addition, CSV and Excel datasets are stored permanently on the remote Talend Data Preparation server.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Talend Data Preparation Free Desktop is meant to work locally on your computer, without the need of an internet connection. Therefore, when using a dataset from a local file such as a CSV or Excel file, the data is copied locally to one of the following folders, depending on your operating system:&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Windows: &lt;STRONG&gt;C:\Users\&lt;EM&gt;your_user_name&lt;/EM&gt;\AppData\Roaming\Talend\dataprep\store&lt;/STRONG&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;OS X: &lt;STRONG&gt;/Users/&lt;EM&gt;your_user_name&lt;/EM&gt;/Library/Application Support/Talend/dataprep/store&lt;/STRONG&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Center of excellence&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H3&gt;
&lt;P&gt;A center of excellence is a group or team that leads other employees and the organization as a whole in some particular area of focus such as a technology, skill, or discipline. As a best practice, build a center of excellence as suggested below.&lt;/P&gt;
&lt;H4&gt;&lt;STRONG&gt;&lt;FONT color="#339966"&gt;Build knowledge&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/H4&gt;
&lt;P&gt;As you deal with raw data, Talend recommends that you build knowledge while you analyze the data. You can:&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Discover and learn data relationships within and across sources, and find out how the data fits together&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Use analytics to discover patterns&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Define the data by collaborating with other business users to define shared rules, business policies, and ownership&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Build knowledge with a catalog, glossary, or metadata repository&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Gain high-level insights to get the big picture of the data and its context&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4&gt;&lt;STRONG&gt;&lt;FONT color="#339966"&gt;Document knowledge&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/H4&gt;
&lt;P&gt;While it is important to build and enhance your knowledge, it is equally important to document the gained knowledge. In particular, every project must maintain a document for:&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Business terminology&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Source data lineage&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;History of changes applied during cleansing&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Relationships to other data&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Data usage recommendations&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Associated data governance policies&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Identified data stewards&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Create a data dictionary&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;As you analyze and understand your data, Talend recommends that you store it in a data dictionary. This helps other users identify the data they are working with, and establish the relationships between various data.&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;A data dictionary is a metadata description of the features included in the dataset&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;In Figure 19, the input file has a column language. At the onset when the input is read, the columns with two languages are marked as invalid.&lt;/P&gt;
&amp;nbsp;
&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFkUAAU.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/123214i9923CC5A8B3C18CB/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFkUAAU.jpg" alt="0693p000008uFkUAAU.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 19: Input file with an invalid language column&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Using the data dictionary, when you change the metadata to accept more than one language as valid input, Data Preparation shows it as a valid record.&lt;/P&gt;
&amp;nbsp;
&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFVgAAM.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/124620i623A1C56A0B7AC02/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFVgAAM.jpg" alt="0693p000008uFVgAAM.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 20: Data dictionary in Data Preparation&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4&gt;&lt;STRONG&gt;&lt;FONT color="#339966"&gt;Back up&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/H4&gt;
&lt;P&gt;Backing up Talend Data Preparation and the Talend Data Dictionary on a regular basis is important to ensure you can recover from a data loss scenario, or any other causes of data corruption or deletion.&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Data Preparation&lt;BR /&gt;&lt;BR /&gt;To create a copy of the Talend Data Preparation instance, back up MongoDB, the folders containing your data, the configuration files, and the logs.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Data Dictionary&lt;/P&gt;
Talend Dictionary Service stores all the predefined semantic types used in Talend Data Preparation. It also stores all the custom types created by users, and all the modifications done on existing types.&lt;BR /&gt;&lt;BR /&gt;To back up a Talend Dictionary Service instance, back up MongoDB, and the changes made to the predefined semantic types.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H4&gt;&lt;STRONG&gt;&lt;FONT color="#339966"&gt;Operationalizing&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/H4&gt;
&lt;P&gt;Talend Data Preparation lets you operationalize the recipes you will use in Talend Studio. This section covers the best practices for operationalizing.&lt;/P&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Promote preparations between environments&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;The best practice when using Talend Data Preparation is to set up one instance for each environment of your production chain.&lt;/P&gt;
&lt;P&gt;Talend only supports promoting a preparation between identical product versions. To promote a preparation from one environment to the other, you have to export it from the source environment, then import it back to your target environment. For the import to work, a dataset with the same name and schema as the one that the export was based on must exist on the target environment.&lt;/P&gt;
&lt;H4&gt;&lt;STRONG&gt;&lt;FONT color="#339966"&gt;Hybrid preparation environments&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/H4&gt;
&lt;P&gt;Sometimes the transformations are either too complex or too bulky to be created in a simple form. To help you in such scenarios, Talend offers a hybrid preparation environment. As a best practice, leverage Studio to create real time datasets, and use these datasets for preparations.&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;You could use either the dedicated Talend preparation service or Talend Jobs to create data preparations&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Leverage the &lt;STRONG&gt;tDatasetOutput&lt;/STRONG&gt; component for output in &lt;STRONG&gt;Create&lt;/STRONG&gt; mode&lt;/P&gt;
&amp;nbsp;
&lt;P&gt;Figure 21 shows the &lt;STRONG&gt;tDatasetOutput&lt;/STRONG&gt; component properties:&lt;/P&gt;
&amp;nbsp;
&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFU3AAM.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/122401i5CA1B381814802AC/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFU3AAM.jpg" alt="0693p000008uFU3AAM.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 21: tDatasetOutput component properties&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;Running the Job creates the dataset in Talend Data Preparation as shown below.&lt;/P&gt;
&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFkjAAE.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/122734iD65F7C3AFCE791BB/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFkjAAE.jpg" alt="0693p000008uFkjAAE.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 22: Run the Job and create the dataset&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Operationalize a recipe in a Talend Job&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;The &lt;STRONG&gt;tDataprepRun&lt;/STRONG&gt; component allows you to reuse an existing preparation, made in Talend Data Preparation, directly in a Data Integration Job. In other words, you can operationalize the process of applying a preparation to input files that have the same model.&lt;/P&gt;
&lt;H3&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Using a recipe as part of a Data Integration flow, or a Talend Spark Batch or Streaming Job in Talend Studio&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H3&gt;
&lt;P&gt;&amp;nbsp;The figure below shows the usage of a preparation/recipe in a Talend Job.&lt;/P&gt;
&lt;P class="lia-indent-padding-left-30px"&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFjTAAU.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/124091i6BE409B9E2D7B989/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFjTAAU.jpg" alt="0693p000008uFjTAAU.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 23: Using a preparation/recipe in a Talend Job&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;You can select a specific preparation as shown below.&lt;/P&gt;
&lt;P class="lia-indent-padding-left-30px"&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFROAA2.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/122163i0D11B01FF995079F/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFROAA2.jpg" alt="0693p000008uFROAA2.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 24: Select a specific Preparation&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;Or you can specify a dynamic preparation as shown in Figure 25. By using a dynamic preparation with context variables, you could build a single Job template to use across projects/organizations.&lt;/P&gt;
&lt;P class="lia-indent-padding-left-30px"&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFktAAE.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/123915i96B71ABFB6218B3C/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFktAAE.jpg" alt="0693p000008uFktAAE.jpg" /&gt;&lt;/span&gt;&lt;SPAN class="lia-inline-image-caption"&gt;Figure 25: Dynamic preparation selection&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;STRONG&gt;Note&lt;/STRONG&gt;: To use the &lt;STRONG&gt;tDataprepRun&lt;/STRONG&gt; component with Talend Data Preparation Cloud, you must have the 6.4.1 version of Talend Studio installed.&lt;/P&gt;
&lt;H4&gt;&lt;FONT color="#339966"&gt;&lt;STRONG&gt;Creating a live dataset&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H4&gt;
&lt;P&gt;What if your business does not need sampling, but needs real live data for analysis? Because the Job is designed in Talend Studio, you can take advantage of the full palette of components and their Data Quality or Big Data capabilities. Unlike a local file import, where the data is stored in the Talend Data Preparation server for as long as the file exists, a live dataset only retrieves this sample data temporarily.&lt;/P&gt;
&lt;P&gt;It is possible to retrieve the result of Talend Cloud flows that were executed on a Talend Cloud engine, as well as on remote engines.&lt;/P&gt;
&lt;UL class="lia-list-style-type-circle"&gt;
&lt;LI&gt;
&lt;P&gt;Use a preparation as part of a Data Integration flow, or a Talend Spark Batch or Streaming Job in Talend Studio.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;The live dataset feature allows you to create a Job in Talend Studio, execute it on demand using Talend Cloud as a flow, and retrieve a dataset with the sample data directly in Talend Data Preparation Cloud.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;The screenshots below show an example of a Job creating a live dataset:&lt;/P&gt;
&lt;P class="lia-indent-padding-left-30px"&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFkyAAE.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/124533i00BE08981F1B4840/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFkyAAE.jpg" alt="0693p000008uFkyAAE.jpg" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P class="lia-indent-padding-left-30px"&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFkaAAE.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/123068iC94261667A863AFF/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFkaAAE.jpg" alt="0693p000008uFkaAAE.jpg" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P class="lia-indent-padding-left-30px"&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="0693p000008uFR4AAM.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/123955i77A12DD964D7BBD4/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693p000008uFR4AAM.jpg" alt="0693p000008uFR4AAM.jpg" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Note&lt;/STRONG&gt;: To create live datasets, you must have the 6.4.1 version of Talend Studio installed, patched with at least the 0.19.3 version of the Talend Data Preparation components.&lt;/P&gt;</description>
      <pubDate>Tue, 23 Jan 2024 02:35:30 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Official-Support-Articles/Talend-Data-Preparation-Best-Practices/ta-p/2151657</guid>
      <dc:creator>TalendSolutionExpert</dc:creator>
      <dc:date>2024-01-23T02:35:30Z</dc:date>
    </item>
  </channel>
</rss>

