Avoid multiple header rows?

Anonymous — Sat, 16 Nov 2024 09:47:42 GMT

I have a couple of CSV files that I load into Data Prep. All at once (I only specify a directory in "Add Dataset", no individual files). So far, so good.

All files have the same structure, the first line is the header.

Is there a way to globally set the first row as header for all files? I know there is this "Row" -> "Make as header..." feature, but what happens in my case is:

file1.csv:

Firstname;Lastname;Age

Felix;Kjellberg;23

Julian;Ilett;43

file2.csv:

Firstname;Lastname;Age

Ben;Heck;58

Dave;Jones;48

The result in Data Prep is:

Firstname|Lastname|Age

Ben Heck 58

Dave Jones 48

Firstname Lastname Age

Felix Kjellberg 23

Julian Ilett 43

So even if I set the blue line as header, the green line will stay. Is there a way to avoid this?

Re: Avoid multiple header rows?

Anonymous — Fri, 12 May 2017 12:47:44 GMT

Hi,

Out of curiosity, can you confirm the following?

You are using Data Prep 2.0.
The CSV files are on HDFS.

To answer your question: there is no dedicated data set parameter or function to remove subsequent occurrences of the header but you can do it in a single preparation step: set a filter on the first column with the column header as filter value (so filter on "Firstname" in your example below) and use the function "delete filtered rows".

Regards,

Gwendal

topic Avoid multiple header rows? in Data Quality

Avoid multiple header rows?

Re: Avoid multiple header rows?