Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi guys,
I need your help, if there is anything possible.
The situation is as follows:
I have a csv file as data source which gets updated multiple times a day. The csv file is semicolon seperated. It always contains a fixed set of columns (20 columns). One column is a description. Like twice a week or so it happens that in 2-5 rows the description contains a semicolon in the text.
As I want to automate the whole data loading process you see the point, right? Is there any way or possibility to do not get an error when this happens, caused by the fact that the rows get displaced due to the semicolon in the text. My first idea was to just replace the semicolon seperator with something else when the source exports the csv file, but this is unfortunately not possible.
Is there any way to overcome this?
Thanks guys!
Greetings,
Seb
Hi guys,
I just found out that all the scripts of you guys does not load my input file properly as I tried Stefans script.
The source file has 1.3mio rows. With Marcos script I get just 13k rows and with Stefans script just around 8k.
If I use smaller sample files like 10 rows both scripts load 10 rows.
What is the reason for that? The rows which are loaded look correct but way too less rows.
I do not have a sample file with this amount of rows to share it with you.
Best regards,
Seb
Sebastian,
if you get issues when loading the data, I would stick to Marco's solution, since mine has an additional aggregation step that might show additional issues with large data sets.
Marco's solution also included an ID field. I would first check if this ID field shows a sequential integer,and what the last number is. This should be the line where the load stucked.
I would suggest also to add a line in the script after the first table load, so we look at each loading stage one after the other.
Exit Script;
Look at your source file. Is there anything special in the line where the QV LOAD stucked? There are sometimes issues with special characters, like EOF.
Hi Stefan,
I checked it and found this in the row after the last ID (in the field description):
File opened in VIM: File openend in Notepad++:
I think Qlikview recognizes it as EOF. If I just load the whole thing without the check and the "no eof" parameter it loads everything.
Normally the sign is a em dash but QlilView recognizes it as EOF.
Stefan why is your script not good for large data sets and where is the aggregation step? Your script is insanely fast and it would be super to get it work with a large data set, because I am afraid that Marcos script will take pretty long with 1.3 mio rows. For smaller data sets it is fast enough but it takes like 1min per 10k rows.
Sebastian,
how does this version perform? Does it load all records? Is it fast enough?
You may need to adapt your changes for EOF filtering in the format specs section as you've done in the last version.
edit:
If not all records are loaded, you can play with this variable
Let vNumRowsAgg = 5000;
and lower the number. This variable determines the number of lines from the original file that are feed into an aggregated LINE.
One lesson learned from this story: avoid data exchanges by .csv at all costs if they prove to be unreliable in the slightest way. Whenever you think your whole setup handles all cases encountered so far, and your document is published to end-users, the data exchange may unexpectedly break down again and will certainly make you go looking for another anomaly. A nice one is Newline-characters in the middle of text fields.
Better methods of data exchange are binary files or direct DB connections. At least they will handle weird text strings without any unexpected problems.
Peter
Hi Stefan,
thanks for this one. It performs better but does not load all records yet.
With 5000 its super fast like 3 Minutes, but only 800k rows. With 1000 I had 1.22 Mio. rows and 6 Minutes (also fast enough). With 500 I had 1.28 Mio rows and around 15 Minutes. Just testing right now with 10 if I get the whole 1.38 Mio rows, but this would probably be too slow.
How is the aggregation working? Does it take 5000 rows at once and put it in 1 row of Line? And why is there such a difference in resulting loaded rows?
I got one idea yesterday as I saw that there are several rows duplicates. I checked it with PowerQuery in Excel. Loaded the whole file but unseparated so 1 complete line = 1 value. I then let PowerQuery delete the duplicates and of 1.3mio rows there were just like 560k left. Is this something where we can optimize the script? I mean we could load it distinct to get less rows to check and load right?
Thanks again for your help!
Fully agree with you Peter. I would also take direct DB connections first but in this case it is unfortunenatly the only source where I can get this data from. So I have to handle it somehow.
Exactely, if you set vNumRowsAgg to 5000, you will concatenate 5000 lines of your source file into one Record in the QV data model.
The FROM_FIELD LOAD is not only looking at a single field, it's also performing a table load per record, that's what is slowing down Marco's approach, the overhead of having 1.3 million table loads, each producing a single line and concatenating to the previous table load.
I must admit I have no clear idea how many lines you can concatenate before you get issues, this probably also depends on the length of the lines. Are the lines similar or differ a lot in terms of character length?
I would also have assumed that there should not be such a slow increase in reading in lines successfully, more like a certain threshold that works or works not. But that's just my guts feeling.
I also thought about just reading your file in using Marco's approach, table1 , then store the whole thing into a text file, then read it in using standard means and embedded labels.
Should work - as soon as I manage to tell QV not to add quotes, which makes it really hard to read the new file in flawlessly. Since this request (no quoting on STORE) has been there for years, I don't think we can expect it soon.
Regarding your idea: If you can live with a DISTINCT table load, just give it a try.
Maybe it's an issue with the table1 size / symbol table size.
Remember that you are loading > 1 million lines, if we merge these lines into fewer LINEs, QV probably isn't able to make use of the DISTINCTness of the values, i.e. QV needs to store the full amount of characters into the symbol table.
BTW: Do you get an error message when the load stops below the 1.3 million records limit? Or does it just silently fail?
In attached v7 of the load script, I load the original text file only partially, vNumRowsAgg lines on each iteration.
Then do the FROM_FIELD LOAD, then drop the table1. Loop.
Maybe this has a positive effect on stability of the load of your complete text file?
Hi,
another approach might be:
// defining the number of the column that might contain semicolons
LET vColumnWithSemicolon = 5;
LET vColumns = 20;
// loading each semicolon separated subfield, and concatenating those who belong to the same field
table1:
Generic
LOAD ID,
'@'&FieldNum,
Concat(SubString,';',Seq)
Group By ID, FieldNum;
LOAD RecNo() as ID,
IterNo() as Seq,
IterNo()-RangeMin(RangeMax(IterNo()-$(vColumnWithSemicolon),0),Fields-$(vColumns)) as FieldNum,
SubField(line,';',IterNo()) as SubString
While IterNo()<=Fields;
LOAD [@1:n] as line,
SubStringCount([@1:n],';')+1 as Fields
FROM [QlikCommunity_Thread_202456_v3.csv] (fix, codepage is 1252)
Where RecNo()>1;
// create temporary table having columns for old and new field names (header row of csv file + ID field)
tabFieldNames:
CrossTable (oldFieldName,newFieldName)
LOAD 1,*
FROM [QlikCommunity_Thread_202456_v3.csv] (txt, codepage is 1252, no labels, delimiter is ';', msq)
Where RecNo()=1;
// create mapping table to translate default field names (@1 to @n) to column headers of csv file
mapFieldNames:
Mapping LOAD oldFieldName, newFieldName
Resident tabFieldNames;
// drop temporary tables
DROP Tables tabFieldNames;
// rename fields using the previously generated mapping table
RENAME Fields using mapFieldNames;
hope this helps
regards
Marco