Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 
drew61199
Contributor
Contributor

Compare two fields for "valid" or "invalid"

I'm building a utilization report and have two diff data sources. One is my dates/valid vehicles....the second contains the actual utilization, but since users manually type in veh numbers, we can have invalid entries (000 for example).

I cartesian join the valid vehicle numbers and dates from Source1 so that when I left join the utilization data, I can properly calculate non-utilization (also a reason for left join b/c I can't have invalid vehicles counting towards utilization).

I need to count the number of invalid entries though. I believe my only solution is a new table where I load both tables worth of vehicles and compare them via an IF statement?! I have a feeling this may not be the most efficient (or possibly totally incorrect method) b/c QLIK times out on my load. Keep in mind the valid/dates table has 5.5M lines and the utilization is just under 3M. I thought loading distinct may help (should cut first table down to 220k or so as the bulk is due to number of days loaded), but I do need unique entries on utilization so I can showcase which facility and days need to be corrected. Hope all of this makes sense

[VEHICLE_CHECK]:
LOAD
DISTINCT
facility,
eventenddate,
vehiclenumber
Resident xTABLE2;

JOIN
LOAD
DISTINCT
vehicle
Resident xTABLE1;

LOAD
If(vehiclenumber = vehicle, 'Valid','Not Valid') as Vehicle_Check;

 

Labels (1)
3 Replies
Digvijay_Singh

Not completely understood the data model, which table has valid vehicle? Which are the fact table and dimension table?

I feel WHERE EXISTS or NOT EXISTS can help you to identify wrong entries but still not sure the right place to use it.

drew61199
Contributor
Contributor
Author

Two SQL queries from two separate sources. 

"Table1" is valid vehicle and valid date

"Table2" is fact / dimension table of utilization (which can include a myriad of errors). This only includes entries, so if a vehicle isn't used on 8/15, then no entry exists. 

My load statement is basically a cartesian join of valid vehicles and dates. I then left join utilization. Doing this allows me to count the non-utilization, but it also throws out invalid entries. I'd like to quantify those too. I assume another table is needed, but also trying to keep the script efficient as we're discussing millions of rows / month

Digvijay_Singh

Just an idea if its feasible - 

Create a mapping table having two fields Vehicle_Key, Vehicle Number

vehicle_map:

Mapping load distinct

Vehicle_Key

Vehicle Number

Resident ValidVehicletable;

 

and then while loading fact table use applymap to create a flag

[VEHICLE_CHECK]:
LOAD
DISTINCT
facility,
eventenddate,
vehiclenumber,

if(applymap('vehicle_map', vehiclenumber,'Invalid')='Invalid',0,1)  AS validIndicator
Resident xTABLE2

Use indicator in expressions to discard invalid vehicles in Utilization or count invalid vehicles

Thanks,