Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Have a data of about ~85,177,324 records in a qlik memory table. I need to assign a rownumber to each record. How can i do it quickly. The above data set is concatenated into a table from various qvds and now I need to have a rownumber to each record entry. Rowno() is taking lot of time to do this and exract takes about one hour to get loaded.
Is there any optimised way to do this..
Data as of now..
Name | Address | pincode |
abc | aaa | 123 |
xyz | sss | 3432 |
pqr | fff | 4324 |
. | . | . |
. | . | . |
. | . | . |
Output
Rowno | Name | Address | pincode |
1 | abc | aaa | 123 |
2 | xyz | sss | 3432 |
3 | pqr | fff | 4324 |
. | . | . | . |
. | . | . | . |
85,177,324 | . | . | . |
If i am using rowno() while reading data from each qvd then it is making load unoptimised. Is there any optimised way to do this considering large data set
I can't do this while creating the qvd.. reason being there ar various qvd from various source and I will be reading data accordingly and then appending rowno or recno. so it has to be here so that number can be in sequence for each different qvd..
Ex - 1 to 100 for qvd 1
101 to 200 for qvd 2 and so on.
Can't do this at the creation of qvds as we won't be sure how many records will be there in 1st and 2nd file.
So plan is to have rec no or row no during this stage where we are aggregating all the 5 qvds
Especially in this case won't your intended approach lead you to a sensible tracking of the records. You will get a sequentially number but you couldn't identify with it where it comes from - at least not if you had only this information.
This means you need more and other information like some hard-coded: 'xyz' as SOURCE and/or filebasename() as FILE to track the file-name and/or subfield(filename(), '\', -2) as FOLDER to fetch more hierarchically details and/or a recno/rowno - everything what you might need should be done by creating the qvd's.
- Marcus
Source detail information is already mentioned in each respective qvd. So we can track that way to identify from were the data is coming.
@rwunderlich Could you please post your code which is loading million of records in a second. Am I missing something here.
My Code..
Data2:
NoConcatenate
LOAD *,RecNo() AS Sequence Resident Data;
Drop Table Data;
Rename table Data2 to Data;
STORE Data INTO D:\Qlikview\Output\Data.QVD (QVD);
DROP TABLE Data;
Call RecordLoadTime('Data');
This loads only thousands of records per second.. 12 min for each block, at the end entire extract is taking approx. 60 min to load data from 5 different sources
Honestly it will be a great help for us 🙂
Here is my test code for 10M rows.
data:
LOAD
'A' & div(RecNo(),2) as A
,'A' & div(RecNo(),5) as B
,'A' & div(RecNo(),10) as C
,'A' & div(RecNo(),100) as D
,'A' & div(RecNo(),500) as E
,'A' & div(RecNo(),1000) as F
AutoGenerate 1e7
;
Store data into data.qvd (qvd);
Drop Table data;
data:
Load * From data.qvd (qvd);
data2:
NoConcatenate
LOAD *, RecNo() as RecId Resident data;
Drop Table data;
Rename table data2 to data;
On a 4 core 32GB laptop here's the timing.
1st Load: 9 seconds
2nd Load: 2 seconds
3rd Load (recno): 5 seconds
I'm curious to hear results for this same script on your system.
-Rob
If you could track the records what is the purpose of having another record-id?
Beside this I could confirm the load-times from Rob's example within a similar environment of 4 cores and 16 GB RAM whereby each statement took 1 second more (only one trial without looking on other processes/resources).
- Marcus
For million of rows and a lot of columns a resident load ist not every time a good option, because it needs much more memory.
Let's have two QVDs, each has 48.7M rows and 157 columns.
Example 1:
A:
Load
*
from "QVD 1.qvd" (qvd);
Concatenate(A)
Load
*
from "QVD 2.qvd" (qvd);
B:
load
RecNo() as RowNo,
*
Resident A;
drop table A;
Runtime: 8:33
Peak working set: 44.7GB
Example 2:
A:
Load
RecNo() as RowNo,
*
from "QVD 1.qvd" (qvd);
let nor=NoOfRows('A');
Concatenate(A)
Load
$(nor)+RecNo() as RowNo,
*
from "QVD 2.qvd" (qvd);
Runtime: 6:44
Peak working set: 26.4GB
So it depends on how much memory is available and how fast is the memory.
Hey
Scenario 1:
I tried your code in new application and it got completed within few seconds. This was awesome..
But when I replicate similar pattern of code in my application it was time consuming.
What can be the issue ?
Scenario 2:
have also observed that if I am reading my final qvd in your code then it is working fine for me. but in my application it is not working in optimized fashion.
😞
Scenario 3:
If I create new tab and add your code in new tab and read data from my final qvd instead of data qvd then again it is taking time to append recno
Server Configuration :
20 Core , 512 Gb Ram
Only if you load a single table within a script you will load THIS table. If there are more loadings the loads may effect on each other.
This is related to the way how the data are stored - each field with its distinct values is stored within an own system-table and a data-table with a bit-stuffed pointer to the system-tables. This means each further load will check if a field already exists and if it runs through this table to check if the value already exists and if not it makes a new entry there and adds the pointer to the data-table, too but if the value exists it takes the existing pointer-value and adds it to the data-table.
Therefore it's not meaningless in which order the loads are executed. Within the most scenarios it won't be relevant and not be noticed but if you hit any timely or resource-depended threshold it may be worth to look on the matter from this point of view.
Further related to the above mentioned data-handling are also measures like loading distinct or setting null-variables or system-variables like verbatim or similar things which may force a processing of the data.
Therefore I suggest to review your code in this way and also if it could be divided into multiple tasks/applications. Of course the last measure increased the overhead but the smaller tasks could be easier developed/maintained and they could in general run in parallel and/or within other time-frames.
Beside this all I recommend again to consider if this recno is really expedient - a quite good explanation to what it leads could you find here:
The Importance Of Being Distinct - Qlik Community - 1466796
which means you should have a good purpose to add such fields to your data.
- Marcus
This is very interesting. My first guess is that your application is using a lot of RAM during the reload and you are paging during the resident load. Although your server has a lot of RAM, can you observe how much the load is using either by watching the task manager or checking the logs? I'm guessing that server is also used for QVS (Qlikview users). What is the total memory usage on the server like?
Can you post a screenshot of the data model in your app?
-Rob
Server is dedicated to Qlikview reload tasks/QMC. There are many extracts which runs on daily basis.
Usage is at bit higher side.
I have managed to use your code and it has worked for me.
What I have done -
In my extract I have loaded data from various sources in optimised format without generating RecNo() as RecId column.
There was a comment by @marcus_sommer where he said that "Only if you load a single table within a script you will load THIS table. If there are more loadings the loads may effect on each other."
Further while refreshing dashboard which is following the above one. Performed a binary load and then have added block of code suggested by you and derived RecId column as advised by you.
This block runs only once each time while reloading dashboard. Its pretty quick and gets completed in few mins not in seconds though.
One more question, are there any chance to miss any record if we are using RecNo() instead of RowNo(). I observed that RowNo() is very slow..
Does RecId guarantees to return sequential integers .
My Code -
Binary [Report.qvw];
Data2:
NoConcatenate
LOAD *, RecNo() as Sequence Resident Data; // Data table is from binary load
Drop Table Data2;
Rename table Data2 to Data;
What do you think @rwunderlich