Talend Data Preparation Best Practices

TalendSolutionExpert · Apr 1, 2021 6:16:28 AM

Talend Data Preparation is a self-service application that enables you to simplify and expedite the time-consuming process of preparing data for analysis or other data-driven tasks.

This article explains the best practices that Talend suggests you follow when working with Talend Data Preparation.

Content:

Data cataloging

While working with large datasets, various inputs, and large teams, it is important to classify datasets and preparations. Talend recommends the following best practices to categorize the artifacts.

Follow naming conventions

Although naming conventions depend on the person or organization, following naming conventions makes it significantly easier for subsequent generations to understand what the system is doing and how to fix or extend the source code for new business needs. While working with Data Preparation, the best practice is to follow the agreed naming standards for the folders, preparations, datasets, and contexts variables.

Folders

Use the following guidelines to name folders for Preparations:

Use camel case
Separate with underscores
Do not use whitespace
Use only alphanumeric characters
Avoid general folder names
Avoid short forms

Preparations and datasets

Preparations and datasets are typically local to a project, so you can set their naming conventions either globally at the organization level, or locally at the project level. Ensure that the naming conventions are strictly followed. Some guidelines are:

Extracted source name
Prefix or suffix dataset extracted date
Business usage
Rules applied

Context variables

Guidelines for using context variables while calling data preparations from Talend Data Integration or Big Data Jobs are:

Create additional contexts for project-specific requirements
- Limit the number of additional contexts you create to less than three new contexts per project
- Instead, opt for a common context group
Context variables must be descriptive
Avoid one-character context variables, for example, a, b, c
Avoid generic names like var1 or var2

Define and follow folder structures

Use folder structures to group items of similar categories or behaviors. As this is completely related to individual products, Talend recommends that you define the folder structures in the project's initial phases. Figure 6 is an example of a folder structure followed in a bank. The folders are divided according to the unit of module. Group datasets by:

Business modules
Sources
Rules applied
Intake areas

Figure 6: A bank user following folder structures as business use case (CreditCard_Defaulters)

Data discovery and profiling

Data profiling and data discovery allow you to analyze and identify the relationships between your data. This section explains some of the best practices for discovering and profiling data.

Pick the right data

Picking the right data is about finding the data best suited for a specific purpose. It is important to note that this should not only be about finding the data you need right now, but it should also make it easier to find data later, when similar needs arise. Best practices for picking the right data are:

Explore and find the data best suited for a specific purpose
- Avoid data with multiple nulls or same/repeated values
- Select values close to the source - avoid calculated or derived values
- Avoid intermediate values
Extract data across multiple platforms
Determine data suitability (for example, discovery, reporting, monitoring, and decision making)
Filter data to select a subject that meets the rules and conditions
Know the source of the data so that you can source it repeatedly

Figure 7 shows some guidelines for what to avoid while picking the right data. This sample dataset of 10,000 employee income records has multiple null values, negative values for defaulters, and repeating names and addresses. This data does not look good, and thus should be discarded. Bring in additional sample data to ensure you are picking the right data.

Understand the data

Understanding data is essential in assessing data quality and accuracy. It is also important to check how the data fits with governance rules and policies. Once you understand the data, you can determine the right level of quality for the data. Best practices for understanding the data are:

Learn data, file, and database formats
Use visualization capabilities to examine the current state of the data
Spot irregularities and inconsistencies in the data
Use profiling to generate data quality metrics and statistical analysis of the data
Understand the limitation of the data

As highlighted below, Talend Data Preparation assists in the process of understanding data.

Figure 8: Data Preparation showing the different data patterns and the valid and invalid record percentage

Verify data types and formats

Data preparation always starts with a raw data file, which comes in many shapes and sizes. Mainframe data is different than PC data, spreadsheet data is formatted differently than web data, and so forth. In the age of big data, there is a lot of variance in source files.

Make sure that you can read the files in the correct format.
Ensure that the data types used are accurate. You need to look at what each field contains. For example, it is good to check that if a file is listed as a number, it contains a number, not the phone number or postal code. Likewise, a character file should not contain all numeric data.

Data preparation shows the successful read input with the data type as shown in Figure 9.

Figure 9: Data Preparation showing the data types for the input data

Note: by using a data dictionary, you can set the type needed for every column.

Data integration

Data integration involves combining data residing in different sources and providing users with a unified view of them. Talend Data Preparation provides a platform where you can integrate data while discovering and profiling. This section explains some of the best practices to keep in mind while integrating data.

Improve the data

Once you have assessed the data's quality and accuracy, and have determined the right level of quality for the purpose of the data, as a best practice you must improve the data by:

Cleansing the data
Noting missing data
Performing identity resolution
Refining and merging-purging the data

Data Preparation offers numerous functions for improving the data as shown in Figure 10.

Figure 10: Data Preparation offers numerous functions for improving the data

Integrate the data

A powerful feature of Data Preparation is the ability to integrate datasets. This takes data preparation to the next level, as now a business can perform simple joins and lookups while preparing rules. As a best practice, integrate data to suit the following needs:

Validating new sources
Integrating and blending data with data from other sources
Restructuring the data according to the needed format for business intelligence, integration, blending, and analysis
Transposing the data

The following screenshot is an example of combining two datasets in Data Preparation.

Figure 11: Combining two datasets in Data Preparation

Data cleansing, standardizing, and shaping

The following best practices describe the techniques to keep in mind while cleansing, standardizing, and shaping data.

Transform the data

Talend Data Preparation is a powerful tool enabling business users to transform their data. Most of the simple yet important transformations can now be applied with simple clicks. As a best practice, Talend recommends:

Creating generalized rules to transform data
Applying transformation functions to structured and unstructured data
Enriching and completing the data
Determining the levels of aggregation needed to answer business questions
Using filters to tailor data for reports or analysis
Incorporating formulas for manipulation requirements

Figure 12: Data transformations applied to a dataset

Verify data accuracy

While making preparations, ensure that the data is accurate and that it makes sense. This is quite an important step and requires some knowledge of the subject area that the dataset is related to. There is not a specific approach to verifying data accuracy.

The basic idea is to formulate some properties that you think the data should exhibit, and test the data to see if those properties are satisfied. Essentially, you are trying to figure out whether the data really is what you have been told it is. In this example, the ID always has to be an 18-digit number, so there is a preparation to validate the ID length.

Figure 13: An example of functions written to verify data accuracy

Identify outliers

Outliers are problematic because they can severely compromise the outcome. For example, a single outlier can have a significant impact on the value of the mean, because the mean is supposed to represent the center of the data. In a sense, this one outlier renders the mean useless.

Outliers are data points that are distant from the rest of the distribution. They are either very large or very small values compared with the rest of the dataset.
When faced with outliers, the most common strategy is to delete them. However, it depends on the individual project requirements.

Talend Data Preparation identifies the outliers by making it easier for the following functions to be applied, as shown in Figure 14.

Figure 14: Quick identification of outliers in Data Preparation

Data enrichment

Data enrichment is a value adding process; this process provides more information about the data to the customer. Use the methods given below to enrich data.

Deal with missing values

Missing values can cause a potential risk to the data being analyzed. They are probably one of the most common data problems you will encounter. As a best practice, Talend recommends that you resolve the missing values. The method depends on the project, but you can:

Replace the missing values with an appropriate value
Replace them with a flag to indicate a blank
Delete the row/record

Figure 15: Dealing with missing values

Share and reuse preparations

Reusability is the best reward in the coding world. It saves a lot of time and effort and makes the whole software development lifecycle easier. With Talend Data Preparation, you can share the preparations and datasets with individual users, or with a group of users. Best practices include:

Sharing and reusing data preparations
Placing the shareable preparation in a shared folder, thereby enabling collaborative work

Figure 16: Data Preparation options to share the folder

Security

Follow the methods given below to secure data while working with Talend Data Preparation.

Protect data

As a best practice, masking is an excellent way to protect sensitive data such as names, addresses, credit cards, or social security numbers. To protect the original data while having a functional substitute, you can use the Mask data (obfuscation) function.

Figure 17: Masking function available in Talend Data Preparation

Preparation versioning

Adding versions to your preparation is an excellent way to see the differences that have been made to the preparation over time, but they also ensure that it is always the same state of a preparation that is used in Talend Jobs. Even if the preparation is still being worked on, versions can be used in Data Integration as well as Big Data Jobs.

Capture the state of your preparation by creating a version, as shown in Figure 18.
Preparation versions are propagated when sharing or moving a preparation across your folder structure, but not when you copy it or apply it to a new dataset.

Figure 18: Versioning in Data Preparation

Change where log files are stored

Talend Data Preparation logs allow you to analyze and debug the activity of Talend Data Preparation. By default, Talend Data Preparation logs in two different places: in the console and a log file. The location of this log file depends on the version of Talend Data Preparation that you are using:

Data_Preparation_Path/data/logs/app.log for Talend Data Preparation
AppData/Roaming/Talend/dataprep/logs/app.log for Talend Data Preparation Free Desktop on Windows
Library/Application Support/Talend/dataprep/logs/app.log for Talend Data Preparation Free Desktop on MacOS

As a best practice, Talend recommends that you change the default location of the log file, which can be configured by editing the logging.file property of the application.properties file.

Understand where your data is stored

Your data is stored in different locations, depending on the version of Talend Data Preparation you are using.

Talend Data Preparation
- If you are a subscription user, nothing is saved directly on your computer.
- Sample data is cached temporarily on the remote Talend Data Preparation server, to improve the product responsiveness. In addition, CSV and Excel datasets are stored permanently on the remote Talend Data Preparation server.
Talend Data Preparation Free Desktop is meant to work locally on your computer, without the need of an internet connection. Therefore, when using a dataset from a local file such as a CSV or Excel file, the data is copied locally to one of the following folders, depending on your operating system:
- Windows: C:\Users\your_user_name\AppData\Roaming\Talend\dataprep\store
- OS X: /Users/your_user_name/Library/Application Support/Talend/dataprep/store

Center of excellence

A center of excellence is a group or team that leads other employees and the organization as a whole in some particular area of focus such as a technology, skill, or discipline. As a best practice, build a center of excellence as suggested below.

Build knowledge

As you deal with raw data, Talend recommends that you build knowledge while you analyze the data. You can:

Discover and learn data relationships within and across sources, and find out how the data fits together
Use analytics to discover patterns
Define the data by collaborating with other business users to define shared rules, business policies, and ownership
Build knowledge with a catalog, glossary, or metadata repository
Gain high-level insights to get the big picture of the data and its context

Document knowledge

While it is important to build and enhance your knowledge, it is equally important to document the gained knowledge. In particular, every project must maintain a document for:

Business terminology
Source data lineage
History of changes applied during cleansing
Relationships to other data
Data usage recommendations
Associated data governance policies
Identified data stewards

Create a data dictionary

As you analyze and understand your data, Talend recommends that you store it in a data dictionary. This helps other users identify the data they are working with, and establish the relationships between various data.

A data dictionary is a metadata description of the features included in the dataset
In Figure 19, the input file has a column language. At the onset when the input is read, the columns with two languages are marked as invalid.

Figure 19: Input file with an invalid language column
Using the data dictionary, when you change the metadata to accept more than one language as valid input, Data Preparation shows it as a valid record.

Figure 20: Data dictionary in Data Preparation

Back up

Backing up Talend Data Preparation and the Talend Data Dictionary on a regular basis is important to ensure you can recover from a data loss scenario, or any other causes of data corruption or deletion.

Data Preparation

To create a copy of the Talend Data Preparation instance, back up MongoDB, the folders containing your data, the configuration files, and the logs.
Data Dictionary
Talend Dictionary Service stores all the predefined semantic types used in Talend Data Preparation. It also stores all the custom types created by users, and all the modifications done on existing types.

To back up a Talend Dictionary Service instance, back up MongoDB, and the changes made to the predefined semantic types.

Operationalizing

Talend Data Preparation lets you operationalize the recipes you will use in Talend Studio. This section covers the best practices for operationalizing.

Promote preparations between environments

The best practice when using Talend Data Preparation is to set up one instance for each environment of your production chain.

Talend only supports promoting a preparation between identical product versions. To promote a preparation from one environment to the other, you have to export it from the source environment, then import it back to your target environment. For the import to work, a dataset with the same name and schema as the one that the export was based on must exist on the target environment.

Hybrid preparation environments

Sometimes the transformations are either too complex or too bulky to be created in a simple form. To help you in such scenarios, Talend offers a hybrid preparation environment. As a best practice, leverage Studio to create real time datasets, and use these datasets for preparations.

You could use either the dedicated Talend preparation service or Talend Jobs to create data preparations
Leverage the tDatasetOutput component for output in Create mode

Figure 21 shows the tDatasetOutput component properties:

Figure 21: tDatasetOutput component properties

Running the Job creates the dataset in Talend Data Preparation as shown below.

Figure 22: Run the Job and create the dataset

Operationalize a recipe in a Talend Job

The tDataprepRun component allows you to reuse an existing preparation, made in Talend Data Preparation, directly in a Data Integration Job. In other words, you can operationalize the process of applying a preparation to input files that have the same model.

Using a recipe as part of a Data Integration flow, or a Talend Spark Batch or Streaming Job in Talend Studio

The figure below shows the usage of a preparation/recipe in a Talend Job.

Figure 23: Using a preparation/recipe in a Talend Job

You can select a specific preparation as shown below.

Figure 24: Select a specific Preparation

Or you can specify a dynamic preparation as shown in Figure 25. By using a dynamic preparation with context variables, you could build a single Job template to use across projects/organizations.

Figure 25: Dynamic preparation selection

Note: To use the tDataprepRun component with Talend Data Preparation Cloud, you must have the 6.4.1 version of Talend Studio installed.

Creating a live dataset

What if your business does not need sampling, but needs real live data for analysis? Because the Job is designed in Talend Studio, you can take advantage of the full palette of components and their Data Quality or Big Data capabilities. Unlike a local file import, where the data is stored in the Talend Data Preparation server for as long as the file exists, a live dataset only retrieves this sample data temporarily.

It is possible to retrieve the result of Talend Cloud flows that were executed on a Talend Cloud engine, as well as on remote engines.

Use a preparation as part of a Data Integration flow, or a Talend Spark Batch or Streaming Job in Talend Studio.
The live dataset feature allows you to create a Job in Talend Studio, execute it on demand using Talend Cloud as a flow, and retrieve a dataset with the sample data directly in Talend Data Preparation Cloud.

The screenshots below show an example of a Job creating a live dataset:

Note: To create live datasets, you must have the 6.4.1 version of Talend Studio installed, patched with at least the 0.19.3 version of the Talend Data Preparation components.

Talend Data Preparation Best Practices