Avoid multiple header rows?

Anonymous · ‎2017-05-11

I have a couple of CSV files that I load into Data Prep. All at once (I only specify a directory in "Add Dataset", no individual files). So far, so good.

All files have the same structure, the first line is the header.

Is there a way to globally set the first row as header for all files? I know there is this "Row" -> "Make as header..." feature, but what happens in my case is:

file1.csv:

Firstname;Lastname;Age

Felix;Kjellberg;23

Julian;Ilett;43

file2.csv:

Firstname;Lastname;Age

Ben;Heck;58

Dave;Jones;48

The result in Data Prep is:

Firstname|Lastname|Age

Ben Heck 58

Dave Jones 48

Firstname Lastname Age

Felix Kjellberg 23

Julian Ilett 43

So even if I set the blue line as header, the green line will stay. Is there a way to avoid this?

Anonymous · ‎2017-05-12

Hi,

Out of curiosity, can you confirm the following?

You are using Data Prep 2.0.
The CSV files are on HDFS.

To answer your question: there is no dedicated data set parameter or function to remove subsequent occurrences of the header but you can do it in a single preparation step: set a filter on the first column with the column header as filter value (so filter on "Firstname" in your example below) and use the function "delete filtered rows".

Regards,

Gwendal

Data Prep

Other

v6.x