Generic job with dynamic schema without header row

Anonymous · ‎2015-07-02

Hi,
I try to build a generic job which processes several hundreds of tables. It reads a flat txt-file with csv formatting into a dynamic schema with a tFileInputDelimited component and after processing writes another csv file. Between input and output the data is processed with a tJavaFlex component which iterates through all columns and processes the data according to the type an length provided in the schema file. The problem I have is, that the input file has no header row which names the columns of the file but the output file requires the correct header row. The corresponding schema of the table is provided in a separate xlsx interface specification file which looks more or less like the one attached. It is read by another subjob prior to the data file.
I managed to transpose the ColumnName into one row/String and add some ; but after several attempts I have no idea how to get this 'schema' row i front of the input data stream so that is recognized by the dynamic schema or to change the metadata column names in a java component after reading the file.
Simplest solution whould be to write a new temporary data file, insert the header row and appand the date from the original file, but this is definitely no practical solution because the input files can reach several GB in size and handeling this whole date just to add one row seems not very efficient to me.

Anonymous · ‎2015-07-02

The images don't really give much information as to what you are doing, so this is based on some assumptions I am making. I don't understand why creating a file to put the header information in first and then appending the rest of the data to that file is inefficient? Are you not writing that data to a file anyway? If so, taking a step to first add the header info would not cause a great deal of latency. I am assuming that most of the files have header information in them, but 1 or 2 do not. You can use boolean logic based on your metadata to decide whether to first add the header data and then add the rest as you would do with every other file. Have I missed something here?

Anonymous · ‎2015-07-02

Hi,
first thanks for the quick response, i'll try to make the difficulties i'm facing a bit more clear.
So there is NO header row in any input file. The only source to get the header information from is to read the right excel spreadsheet acording to the table name (which is the same like the input file name and handled to the job via context variable) and extract the column names from the "ColumnName" column. But as far as I tested and read it in the documentation, reading data into a dynamic schema has a bit "special" behavior. Using a dynamic schema with tFileInputDelimited will always read a header row "whether the Header value is set to 0 or to 1". So while reading the file the first data row is misinterpreted as header row.
I tried two approaches:
1. If I change the metadata.columnNames in a Java component, I lose the information of the first row. So I have to insert the header before reading the the file which than means creating a new temporary file, add the csv formatted header, add the data and then read this file in another subjob with dynamic schema. This is working, but writing a 8GB temporary file just to add one column at the beginning and than writing another 8GB output file is not very efficient.
2. Creating the output file with the header row and then append all data processed by the processing subjob. At first it looks quite ok, but as the first row is not used as data but as header row, this is not processed and transformed in the tJavaFlex component which is transforming the data.
So I guess I have two options, somehow adding the header row to the data before reading it and processing it with Talends dynamic schema without writing a whole temporary file. At the moment I read in the file as shown in the picture below.
Or choosing a predefined schema (a File delimited schema, generic schema or whatever appropriate) dynamically at runtime. But as far as i digged into this "dynamic schema" topic I think it is not possible to select a schema via a variable (see picture)
I'm quite new to Talend and I thought it's not a big deal to read the schema of a input file at runtime, but I'm slowly running out of ideas.
Thanks,
Thomas

Anonymous · ‎2015-07-02

Maybe an "out of the box" solution is what you need. How about concatenating the file with a header row file before reading it in? If you are using Linux or Unix to run your jobs you can use....

cat file1.txt file2.txt > new.txt

....using a tSystem component.

On Windows you can use the copy command, but it might require a batch file to get it working as you would need.
This should be significantly quicker than reading in all of the rows and outputting a file 1 row at a time.
The beauty of Talend is that you can make use of practically any technology already on your system and any library functionality, you just need to know how 🙂

Anonymous · ‎2015-07-02

Hi,
Thanks, that could be a solution and I'll give it a try and test how long it takes. But I guess this is not helping me to avoid all those hard drive I/O operations to write a physical new file on the disk. If it is possible I really would like to avoid this, because it generates GB of HD traffic just to add some rows. But I will test how long it takes.
The dev environment runs on windows, for the execution server I'm not sure.
Thanks,
Thomas

Anonymous · ‎2015-07-02

I'm afraid you can't append rows to the beginning of a file due to filesystem limitations. You will need to create a new one in order to do this. You can overwrite bytes at the beginning of a file and I have done this before, but that would overwrite data you probably want to keep.
The problem here appears to be the dynamic schema. Do you, by any chance, know the total number of columns you may be working with? For example, for all of your files you know you won't get more columns than 40, for example? If so, there may be a work around not using the dynamic schema.
I've written a tutorial for dealing with dynamic schemas for people not using the Enterprise Edition of Talend. This is a very basic example, but you sound like you would quite easily be able to extrapolate from this. The tutorial is here.

Anonymous · ‎2015-07-02

Sadly it's not really a practical option with fixed number of columns. The tables got exported from a legacy host system and vary between just a few and hundreds of rows for several hundreds of tables. In contrast to this the processing itself is relatively simple to determine by the type and the length of the column.
Right now there is a job for every table with a fixed schema and deploying a small change to all jobs takes way to long. Thats why a generic job with dynamic schema handling would massively improve maintenance.
Anyway, thank for the tutorial and the help so far.
Thomas

Anonymous · ‎2015-07-02

Please keep in mind that a fixed schema does not need to match the columns exactly. By that I mean, if you have a fixed schema of 100 Strings (for example), this can be used with files with 1 column up to 100 columns when dealing with CSVs. Once the data is imported in String format, you can use your tJavaFlex or tJavaRow to cycle through the 100 columns to see if there is any data present and carry out checks on data type, etc.
The problem you face here is that there appears to be some assumption that this sort of variance can be handled with a one size fits all solution. Unfortunately that is rarely the the case without a few complications at least.

I have one more solution that might suit you IF you can read/write Java or have someone who can do that. All of Talend's components are supplied with their source freely viewable. You can also create your own components for Talend. You might be able to get round this issue (of forcing the first row to be considered the row header) by taking the source code for the component you are using and tweaking it to do what you want. You need Java and JET skills. JET is basically a tool for writing Java to write Java. Not terribly difficult once you get to grips with it. I think that might actually be your more elegant solution.

Anonymous · ‎2015-07-09

I tested a bit with piping the header and the data together in a temporary file, which is working so far. But the problem with this is that the date has to be physically moved on the hard drive, which is not acceptable for big files, as writing large data to the disk is slow.
In the tFileInputDelimited-Dokumentation there is a Note:

If you want to use another row as the header row, set the Header value accordingly.

Is it in any way possible to read the Header row from the end of the file, but using the data from the beginning?
I tested the job by appending the header to the files end and set the header field in the tFileInputDelimited to

(Integer)globalMap.get("tFileRowCount_1_COUNT")-1

but as I expected it is not reading the header from this line, but also the data after this line. Accordingly the output file is empty with a nice header

Anonymous · ‎2015-07-09

I think you will have to make a choice between recreating the file with a header in the first row, dissecting the tFile components to find a way of modifying one to allow you to retrieve the schema from the last row (this will be tricky, but certainly possible), or moving away from the dynamic schema idea.
The first option is the easiest and with today's technology there really is no requirement to reduce HD I/O. I can understand the wish to be as efficient as possible, but Talend does not support the requirement you have without a serious amount of bespoke coding.
The second option will require a serious amount of learning (of how the components are built) and coding. This will be possible, but the component creation is not documented very well at all. I have written a couple of components and it has the advantage of teaching you how Talend works under the bonnet, but it is the most frustrating process I have come across with Talend (apart from using the MDM tool, but that is another story).
The third option is probably the easiest if you really do not want to recreate the file. As I said before, you can use a fixed schema of 100 or so String columns to read your files. If you keep data on the columns in another file, would it be too much trouble to store the column order? With the column order and file name/type you can identify the columns and then process them accordingly in a tMap.
The problem you have here is that the Dynamic Schema functionality is driven by column name. Since you do not have the column name in the original file, you either have to find another way of processing the file or re-write the file with the column names. Personally I would go with creating a new version of the file as that will require the least amount of work.

csv file

Java

Talend Data Integration

v5.x