Solved: Visualize disparate Data from SQL Sources - Qlik Community

Report Inappropriate Content · ‎2015-11-27

We need to compare back-end SQL data from different sources.

Let's say we have an ETL or Operational process.

It processes and modifies data (in mlns).

99.9% of the data is the same.

We need to identify the "Delta" (missing, new, redundant and modified data).

Data is coming from different sources (different Servers etc).

That is why we can not use full outer join and where not exists in SQL.

We have to do similar functionality programmatically in QLV Load.

We have one SOURCE of data (=CORRECT) data.

Lets call it: "SOURCE".

We have a second source (or ETL target) = SUSPECT data.

Lets call it: "STORE SOURCE_STRANGE_DATA".

See data scaffolding:

in SQL Server 2008R2 DB AdventureWorks2008R2

New view created: [Person].[vStateProvinceCountryRegion_STRANGE_DATA]

See Att.1.

It reproduces all rows from the underlying view: vStateProvinceCountryRegion

correctly except the following 4 rows ("STORE SOURCE_STRANGE_DATA"):

1) Misspelt existing row: '<MISSPELT>Ontario'

2) Existing (correct) row duplicated: Alberta duplicated

3) Missing row: Quebec is missing

4) NEW row: <NEWPROVINCE> (StateProvinceID=999, i.e. new key identication)

See CORRECT list of provinces:

AB	Alberta	CA
BC	British Columbia	CA
NB	Brunswick	CA
LB	Labrador	CA
MB	Manitoba	CA
NF	Newfoundland	CA
NT	Northwest Territories	CA
NS	Nova Scotia	CA
ON	Ontario	CA
PE	Prince Edward Island	CA
QC	Quebec	CA
SK	Saskatchewan	CA
YT	Yukon Territory	CA

See Suspect List of Provinces ("STORE SOURCE_STRANGE_DATA".):

(Quebec is missing from the list):

ON	<MISSPELT>Ontario	CA
NN	<NEWPROVINCE>	CA
AB	Alberta	CA
AB	Alberta	CA
BC	British Columbia	CA
NB	Brunswick	CA
LB	Labrador	CA
MB	Manitoba	CA
NF	Newfoundland	CA
NT	Northwest Territories	CA
NS	Nova Scotia	CA
PE	Prince Edward Island	CA
SK	Saskatchewan	CA
YT	Yukon Territory	CA

Functional purpose:

Identify all 4 scenarios, listed above.

1) Keep both sources intact (putting them in the separate tab).

2) Programmatically identify "Delta" (any row that it New/Missing/Changed/Duplicate).

List of columns (used for comparison) should be easily modified, assume not reliable composite key.

I have come close to the solution, but still it is pretty bulky.

This solution should be short and robust and may be (auto-generated) code for each of the hundreds of the table (being compared).

See the script: (it uses separate DSNs for sources, assuming different servers/connections):

ODBC CONNECT TO UserDSN;

SOURCE:

LOAD

AutoNumberHash256 (StateProvinceID, TerritoryID,StateProvinceName,CountryRegionName) as Autonumbered_Hash,

StateProvinceID

,StateProvinceCode

,IsOnlyStateProvinceFlag

,StateProvinceName

,TerritoryID

,CountryRegionCode

,CountryRegionName;

SQL SELECT *

// correct data:

from [AdventureWorks2008R2].[Person].[vStateProvinceCountryRegion];

STORE SOURCE INTO C:\temp\SOURCE.QVD (qvd);

// ----------------------------------------------------------------------------------------------------------------

QUALIFY *;

UNQUALIFY Autonumbered_Hash;

ODBC CONNECT TO UserDSN_STRANGE_DATA;

SOURCE_STRANGE_DATA:

// the only way to see(load) new, redundant, chnaged values is right keep (but MISSING values DISSAPEAR from source!)

RIGHT KEEP

LOAD AutoNumberHash256 (StateProvinceID, TerritoryID,StateProvinceName,CountryRegionName) as Autonumbered_Hash,

// want to keep it separately for dirty flag detection:

AutoNumberHash256 (StateProvinceID, TerritoryID,StateProvinceName,CountryRegionName) as SSD_Autonumbered_Hash,

StateProvinceID

,StateProvinceCode

,IsOnlyStateProvinceFlag

,StateProvinceName

,TerritoryID

,CountryRegionCode

,CountryRegionName;

SQL SELECT *

// data for comparison:

from [AdventureWorks2008R2].[Person].[vStateProvinceCountryRegion_STRANGE_DATA];

STORE SOURCE_STRANGE_DATA INTO C:\temp\SOURCE_STRANGE_DATA.QVD (qvd);

QUALIFY *;

UNQUALIFY Autonumbered_Hash;

DELTA:

RIGHT KEEP

LOAD

(-1)* (IsNull(Lookup ('Autonumbered_Hash', 'Autonumbered_Hash', Autonumbered_Hash, 'SOURCE'))) as Delta_FLAG

,*

FROM C:\temp\SOURCE_STRANGE_DATA.QVD (qvd)

WHERE 1=1

//and IsNull(Lookup ('Autonumbered_Hash', 'Autonumbered_Hash', Autonumbered_Hash, 'SOURCE')) <> 0

;

// end script

Lets see the results:

DELTA.Delta_FLAG	DELTA.SOURCE_STRANGE_DATA.StateProvinceCode	DELTA.SOURCE_STRANGE_DATA.StateProvinceName	DELTA.SOURCE_STRANGE_DATA.CountryRegionName
1	NN	<NEWPROVINCE>	Canada
1	ON	<MISSPELT>Ontario	Canada
0	AB	Alberta	Canada
0	BC	British Columbia	Canada
0	LB	Labrador	Canada
0	MB	Manitoba	Canada
0	NB	Brunswick	Canada
0	NF	Newfoundland	Canada
0	NS	Nova Scotia	Canada
0	NT	Northwest Territories	Canada
0	PE	Prince Edward Island	Canada
0	SK	Saskatchewan	Canada
0	YT	Yukon Territory	Canada

By exposing "Delta_Flag" as a control in QLV, we can identify the modified and new data.

What is incorrect:

1) Quebec (missing row) is NOT shown and not identified.

Moreover, the right keep removed it from the source.

We need the functional analogue of the FULL OUTER KEEP.

How to reach it?

2) Duplicate row not shown (Alberta).

This is easy however, with a separate QVW list box control (expression showing non-distinct count).

Alberta	2
British Columbia	1
Brunswick	1
Labrador	1
Manitoba	1
Newfoundland	1
Northwest Territories	1
Nova Scotia	1
Prince Edward Island	1
Saskatchewan	1
Yukon Territory	1

How to make it (duplicate row) a programmatic flag in script?

3) How to make the script smaller and robust (business want to visualize both sources intact separately, and delta separately though).

4) Sorry, I do not know how to use LOAD ... KEEP so NOT to affect the previously uploaded table at all (like full outer keep) in any way, see on example Quebec dissapeared from the source above (using right keep).

Left keep would discard the new rows in second (suspect) source though.

We need to keep both.

Source: (Quebec was filtered out by right keep):

Alberta

British Columbia

Brunswick

Labrador

Manitoba

Newfoundland

Northwest Territories

Nova Scotia

Prince Edward Island

Saskatchewan

Yukon Territory

Can anybody show in detail how to combine left and right keep so not to affect the previously loaded table?

thank you in advance,

swuehl · ‎2015-11-28

This might give you an idea how to start (or a possible start):

Re: How to compare 2 tables and put non-matching results?

This should handle records existing only in one table (i.e. missing / new in second table) and changed / modified values.

I am not sure how it handles duplicate rows, but as you said, it should be easy finding these with a list box.

View solution in original post

swuehl · ‎2015-11-28

This might give you an idea how to start (or a possible start):

Re: How to compare 2 tables and put non-matching results?

This should handle records existing only in one table (i.e. missing / new in second table) and changed / modified values.

I am not sure how it handles duplicate rows, but as you said, it should be easy finding these with a list box.

Report Inappropriate Content · ‎2015-12-14

Thank you so much for the reference/code.

I have used this code for comparison of sources.

See Script1 attached for reference.

First we load source (1.1) (trusted data) and suspect data (1.2)

There is NO joins between them:

QUALIFY *;

After that we allow the join:

UNQUALIFY *;

and formulate (1.3) the Delta (difference) between sources programmatically:

We allow QVW engine to join by the composite key (i.e. turning the composite key into a surrogate key):

KEY_JOINED

See this part of the script:

//(1.3) load Source Formulate the difference between the sources programmatically:

DELTA:

Concatenate (SOURCE)

LOAD

AutoNumberHash256 (StateProvinceID, TerritoryID,StateProvinceName,CountryRegionName) as Autonumbered_Hash,

AutoNumberHash256 (StateProvinceID, TerritoryID,StateProvinceName,CountryRegionName) as SSD_Autonumbered_Hash,

AutoNumberHash256 (StateProvinceID, TerritoryID) as KEY_JOINED,

StateProvinceID

,StateProvinceCode

,IsOnlyStateProvinceFlag

,StateProvinceName

,TerritoryID

,CountryRegionCode

,CountryRegionName

,'DELTA' as Table;

SQL SELECT *

from [AdventureWorks2008R2].[Person].[vStateProvinceCountryRegion_STRANGE_DATA];

INNER JOIN (SOURCE)

LOAD *

WHERE

"Only in One Table?"

OR "Hash Different?"

OR "Duplicate or Not Match?"

;

LOAD

KEY_JOINED

// (1.4) formulating the 3 data difference scenarios:

,if(count(KEY_JOINED)=1,-1, 0) as "Only in One Table?"

,if(maxstring(Autonumbered_Hash)<>minstring(Autonumbered_Hash),-1,0) as "Hash Different?"

,if(count(KEY_JOINED) <> 2, -1, 0 ) as "Duplicate or Not Match?"

RESIDENT SOURCE

GROUP BY KEY_JOINED

;

See the resulting log:

Connecting to UserDSN

Connected

SOURCE << [vStateProvinceCountryRegion] 181 lines fetched

Connecting to UserDSN_STRANGE_DATA

Connected

SOURCE_STRANGE_DATA << [vStateProvinceCountryRegion_STRANGE_DATA] 182 lines fetched

SOURCE << [vStateProvinceCountryRegion_STRANGE_DATA] 363 lines fetched

DELTA << SOURCE 4 lines fetched

See that only 4 rows populate the Delta recordset.

See that we exposed the source of each data as a column:

,'SOURCE' as Table; ...

,'DELTA' as Table;

See the end result:

there are 3 analytic flags, we can filter by them:

We have reached the goal:

identifying all the main scenarios:

1) Missing data (Quebec missing in suspect data)

2) new rows (NewProvince was missing in the source/correct data)

3) Misspelt rows (see Ontario misspelt, actually number of columns might be in hundreds, so it is very simplified example)

4) duplicate rows (see Alberta below)

The first column: "table" shows us the source of the data.

The only functional requirement that I did not reach was keeping both sources intact (because I had to concatenate two sources, source data now reduced only to delta/modified data).

Is there any way to make the script simpler?

Table	Only in One Table?	Hash Different?	Duplicate or Not Match?	CountryRegionCode	CountryRegionName	StateProvinceName
SOURCE	0	0	-1	CA	Canada	Alberta
DELTA	0	0	-1	CA	Canada	Alberta
SOURCE	0	-1	0	CA	Canada	Ontario
DELTA	0	-1	0	CA	Canada	<MISSPELT>Ontario
SOURCE	-1	0	-1	CA	Canada	Quebec
DELTA	-1	0	-1	CA	Canada	<NEWPROVINCE>