Using Data Preparation as part of a Data Ingestion Pipeline

TalendSolutionExpert · Apr 1, 2021 6:01:00 AM

Introduction

This article describes how to use Talend Data Preparation (Data Prep) as part of a data ingestion pipeline. This is a common business use case in which data from various disparate systems is required to be translated to a common format, standardized, reformatted, validated, and then passed in that standard format to other systems.

This example shows how you can build a Talend Job to ingest data from a database, convert it to a common format, run a Talend Data preparation against that data, and then pass it on to other systems, in this case exporting as both a Comma Separated Value (CSV) and an MS Excel file. Any of these systems can, of course, be replaced by any number of things, but what will be common will be translating the data to a common format and running a data preparation.

Building the Talend Job

The Job you will build is shown below:

Each component will be dealt with separately.

Input the Data

In this example, you will input data from a MySQL database. You set up a database table in MySQL using some example data; a section of the data is shown below. This shows the first part of your use case, getting data from a system in a specific format. You could use any number of different systems, but this example uses a database.

To represent that data in Talend you need to define a schema, as shown below:

You will then use the tMysqlInput component to get the data from this table. The component configuration is shown below:

Transform the Data

You now need to transform the data into the common format you are using in your use case. This common format is shown below, and contains the following columns:

This is your common format. In this use case, you want to translate ALL incoming data into this format. To do so, use a tMap component and map the fields as shown below:

Run a Data Preparation

The next step is to use a Data Preparation to run against the data. To do so, use the tDataprepRun component. On the component configuration tab, you can either use an existing preparation, or create a new one. In this example, you will create a new one.

On the Component tab, click Create a new one, then select the Edit preparation in your browser icon.
The Data Preparation home page window opens in your web browser and the Data Preparation you just created is displayed:
Select the Data Preparation to edit it. You can choose any number of things to do to the data; this example's use case is shown below:

This example shows selecting a few simple functions for a small number of columns in the data. You may choose to follow this example or choose your own.

Manage the Output

In this example, you will send the data to two different places and formats. You will use the tReplicate component to send multiple copies of the output to one comma separated value file and one MS Excel spreadsheet. This is shown below:

Configure the components like this:

Replicate the input/output schema in the tReplicate Component:
Configure the final output components like this:
- For the tFileOutputExcel component:
- For the tFileOutputDelimited component:

Run the Job

The Job is now built and configured. The next step is to run it, as shown below:

Once run, you should get output similar to this:

Examine the Output

In this example, you have a dataset containing 6040 records. You can now examine the output files, either by opening them or using the Talend Data Viewer:

Comparing this to the input data, you can see that the data has indeed been transformed by running through a Data Preparation.

MoeE · ‎2024-09-04

Hi,

Can this also be used in a Talend Cloud Pipeline?

Regards,

Mohammed

Xiaodi_Shi · ‎2024-09-24

Hello @MoeE

As tDataprepRun component in Talend studio, is talend studio in your use case?

Here are data prep export options we have, so far.

https://help.qlik.com/talend/en-US/data-preparation-user-guide/Cloud/list-of-supported-connectors

In addition to that, we have "cross-dataset" function today and it is the lookup function, which matches data from the current preparation with its counterpart in a reference dataset, allowing you to add this reference data to your preparation.

https://help.qlik.com/talend/en-US/data-preparation-user-guide/Cloud/use-lookup

Best regards

Sabrina

Using Data Preparation as part of a Data Ingestion Pipeline