Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
I'm rewriting a process and need to compare the new process results with the current process. The process outputs a table with tens of millions of rows with 100 fields. Most of the fields are text, with a few timestamps and some numbers. The primary key is a concatenation of 5 fields.
I need to identify:
1 - rows that are in the original table and not in the new process table
2 - rows that are in the new process table and not the original table
3 - rows with the same key but different contents in the non-key fields, and what the differences are.
Has anyone done something similar and be willing to share an efficient methodolgy and code?
Thanks,
Barry
Well, hopefully, since the flags were defined using a boolean expression, they'll take very little memory. But looking at the .mem file, it seems they take 4 bytes each. Well, what if we use -1/null instead of -1/0? Now we're talking. Assuming differences are sparse, you won't have nearly as many values established, and it looks like they're one byte each if I'm reading the .mem file correctly. Example updated.
Wow, that will be really heavy. If your table is that wide and has that many text fields, there is just no getting around that. So if you can drop some of the fields out you will improve performance. Also, the shorter you make your key, the better.
Given how large the data is, it would be a bad idea to try to do this in the front end. I would recommend using the exists() function in the script to set flags and then using set analysis in the front end to check the flags.
Regards,
Hi Vlad,
Thanks for your quick reply. Heavy processing indeed. I made the key using autonumber, so that should make it as small as possible. It's easy on the back end to determine if a key exists in one data set and not the other, but it's much tougher to compare each of the fields to see if they are equal.
Basically, I have two tables, an old and a new joined by a common key. Besides the key the field names are all qualified with the table name. In my comparison chart I have expressions for each field like this, and in the chart I can sort by the value to show which key has a discrepancy.
if(old.field <> new.field, 1, 0)
Do you think I can do this in the back end, and then just put the results in the chart?
Thanks,
Barry
I think the attached may help you, though I don't guarantee high performance. In the example, we're comparing the "same" records across three different months, but it should be easy to see how this could be the "same" records across two different tables. When the record is in one month but not another, the key field (OrderItem) and the month are highlighted. when the record is in all of the months, but there are differences between the records, the fields that are different are highlighted.
And here's an example that may be closer to what you're asking for, but I'm posting it second because I think the other one is a little more elegant and extensible. But this one is specifically comparing two different tables. It is also, for performance purposes, throwing out any rows that are the same in both tables.
Hmmm, yeah, Vlad's right. For tables that big, you want to do as much processing in the script as possible. Here's a version that tries to do that on top of tossing out records that match.
Hi John,
Thanks for your help. I agree that the first one is more elegant, but the second one does have the advantage of throwing away the identical rows, which I expect will be over 98%, reducing the size of the QlikView. I've been resisting concatenating all the fields together because the string of 100 fields is so huge, but I just may have to do it.
Thanks again,
Barry
Barry,
Yes something like that could work. I also like John's last posted example. Keep in mind that there are always tradeoffs: you will improve front-end performance significantly by doing your flagging in the script, but you will increase load time by a lot because there will be tens of millions of IF statements that check text fields and that's always heavy. You might want to try to do the equivalent with a WHERE clause, which should improve speed because not all records will be loaded.
Regards,
Well, in that particular case, the strung together field is only temporary. But it could still cause serious memory problems DURING the load. You could move that part down into the inner join, so that it only does it and then discards it row by row instead of for the entire table at once. That's just a better solution, period. So here's the new example.
Edit: Wait, there's no reason for the left join at the end. You can save a lot of load time by setting and checking the flags during the inner join, which incidentally keeps you from needing to string together all the fields at all. Example updated.
Hi John,
I also thought of building indicators by field for where the differences are. I'll try both approaches over the weekend to see which is faster for the user, because there will be double the number of fields even though the background expression is simpler. I'll let you know which works better in this situation.
You guys are super.
Thanks for your help.
Barry