Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
We need to compare back-end SQL data from different sources.
Let's say we have an ETL or Operational process.
It processes and modifies data (in mlns).
99.9% of the data is the same.
We need to identify the "Delta" (missing, new, redundant and modified data).
Data is coming from different sources (different Servers etc).
That is why we can not use full outer join and where not exists in SQL.
We have to do similar functionality programmatically in QLV Load.
We have one SOURCE of data (=CORRECT) data.
Lets call it: "SOURCE".
We have a second source (or ETL target) = SUSPECT data.
Lets call it: "STORE SOURCE_STRANGE_DATA".
See data scaffolding:
in SQL Server 2008R2 DB AdventureWorks2008R2
New view created: [Person].[vStateProvinceCountryRegion_STRANGE_DATA]
See Att.1.
It reproduces all rows from the underlying view: vStateProvinceCountryRegion
correctly except the following 4 rows ("STORE SOURCE_STRANGE_DATA"):
1) Misspelt existing row: '<MISSPELT>Ontario'
2) Existing (correct) row duplicated: Alberta duplicated
3) Missing row: Quebec is missing
4) NEW row: <NEWPROVINCE> (StateProvinceID=999, i.e. new key identication)
See CORRECT list of provinces:
AB | Alberta | CA |
BC | British Columbia | CA |
NB | Brunswick | CA |
LB | Labrador | CA |
MB | Manitoba | CA |
NF | Newfoundland | CA |
NT | Northwest Territories | CA |
NS | Nova Scotia | CA |
ON | Ontario | CA |
PE | Prince Edward Island | CA |
QC | Quebec | CA |
SK | Saskatchewan | CA |
YT | Yukon Territory | CA |
See Suspect List of Provinces ("STORE SOURCE_STRANGE_DATA".):
(Quebec is missing from the list):
ON | <MISSPELT>Ontario | CA |
NN | <NEWPROVINCE> | CA |
AB | Alberta | CA |
AB | Alberta | CA |
BC | British Columbia | CA |
NB | Brunswick | CA |
LB | Labrador | CA |
MB | Manitoba | CA |
NF | Newfoundland | CA |
NT | Northwest Territories | CA |
NS | Nova Scotia | CA |
PE | Prince Edward Island | CA |
SK | Saskatchewan | CA |
YT | Yukon Territory | CA |
Functional purpose:
Identify all 4 scenarios, listed above.
1) Keep both sources intact (putting them in the separate tab).
2) Programmatically identify "Delta" (any row that it New/Missing/Changed/Duplicate).
List of columns (used for comparison) should be easily modified, assume not reliable composite key.
I have come close to the solution, but still it is pretty bulky.
This solution should be short and robust and may be (auto-generated) code for each of the hundreds of the table (being compared).
See the script: (it uses separate DSNs for sources, assuming different servers/connections):
ODBC CONNECT TO UserDSN;
SOURCE:
LOAD
AutoNumberHash256 (StateProvinceID, TerritoryID,StateProvinceName,CountryRegionName) as Autonumbered_Hash,
StateProvinceID
,StateProvinceCode
,IsOnlyStateProvinceFlag
,StateProvinceName
,TerritoryID
,CountryRegionCode
,CountryRegionName;
SQL SELECT *
// correct data:
from [AdventureWorks2008R2].[Person].[vStateProvinceCountryRegion];
STORE SOURCE INTO C:\temp\SOURCE.QVD (qvd);
// ----------------------------------------------------------------------------------------------------------------
QUALIFY *;
UNQUALIFY Autonumbered_Hash;
ODBC CONNECT TO UserDSN_STRANGE_DATA;
SOURCE_STRANGE_DATA:
// the only way to see(load) new, redundant, chnaged values is right keep (but MISSING values DISSAPEAR from source!)
RIGHT KEEP
LOAD AutoNumberHash256 (StateProvinceID, TerritoryID,StateProvinceName,CountryRegionName) as Autonumbered_Hash,
// want to keep it separately for dirty flag detection:
AutoNumberHash256 (StateProvinceID, TerritoryID,StateProvinceName,CountryRegionName) as SSD_Autonumbered_Hash,
StateProvinceID
,StateProvinceCode
,IsOnlyStateProvinceFlag
,StateProvinceName
,TerritoryID
,CountryRegionCode
,CountryRegionName;
SQL SELECT *
// data for comparison:
from [AdventureWorks2008R2].[Person].[vStateProvinceCountryRegion_STRANGE_DATA];
STORE SOURCE_STRANGE_DATA INTO C:\temp\SOURCE_STRANGE_DATA.QVD (qvd);
QUALIFY *;
UNQUALIFY Autonumbered_Hash;
DELTA:
RIGHT KEEP
LOAD
(-1)* (IsNull(Lookup ('Autonumbered_Hash', 'Autonumbered_Hash', Autonumbered_Hash, 'SOURCE'))) as Delta_FLAG
,*
FROM C:\temp\SOURCE_STRANGE_DATA.QVD (qvd)
WHERE 1=1
//and IsNull(Lookup ('Autonumbered_Hash', 'Autonumbered_Hash', Autonumbered_Hash, 'SOURCE')) <> 0
;
// end script
Lets see the results:
DELTA.Delta_FLAG | DELTA.SOURCE_STRANGE_DATA.StateProvinceCode | DELTA.SOURCE_STRANGE_DATA.StateProvinceName | DELTA.SOURCE_STRANGE_DATA.CountryRegionName |
1 | NN | <NEWPROVINCE> | Canada |
1 | ON | <MISSPELT>Ontario | Canada |
0 | AB | Alberta | Canada |
0 | BC | British Columbia | Canada |
0 | LB | Labrador | Canada |
0 | MB | Manitoba | Canada |
0 | NB | Brunswick | Canada |
0 | NF | Newfoundland | Canada |
0 | NS | Nova Scotia | Canada |
0 | NT | Northwest Territories | Canada |
0 | PE | Prince Edward Island | Canada |
0 | SK | Saskatchewan | Canada |
0 | YT | Yukon Territory | Canada |
By exposing "Delta_Flag" as a control in QLV, we can identify the modified and new data.
What is incorrect:
1) Quebec (missing row) is NOT shown and not identified.
Moreover, the right keep removed it from the source.
We need the functional analogue of the FULL OUTER KEEP.
How to reach it?
2) Duplicate row not shown (Alberta).
This is easy however, with a separate QVW list box control (expression showing non-distinct count).
Alberta | 2 |
British Columbia | 1 |
Brunswick | 1 |
Labrador | 1 |
Manitoba | 1 |
Newfoundland | 1 |
Northwest Territories | 1 |
Nova Scotia | 1 |
Prince Edward Island | 1 |
Saskatchewan | 1 |
Yukon Territory | 1 |
How to make it (duplicate row) a programmatic flag in script?
3) How to make the script smaller and robust (business want to visualize both sources intact separately, and delta separately though).
4) Sorry, I do not know how to use LOAD ... KEEP so NOT to affect the previously uploaded table at all (like full outer keep) in any way, see on example Quebec dissapeared from the source above (using right keep).
Left keep would discard the new rows in second (suspect) source though.
We need to keep both.
Source: (Quebec was filtered out by right keep):
Alberta |
British Columbia |
Brunswick |
Labrador |
Manitoba |
Newfoundland |
Northwest Territories |
Nova Scotia |
Prince Edward Island |
Saskatchewan |
Yukon Territory |
Can anybody show in detail how to combine left and right keep so not to affect the previously loaded table?
thank you in advance,
This might give you an idea how to start (or a possible start):
Re: How to compare 2 tables and put non-matching results?
This should handle records existing only in one table (i.e. missing / new in second table) and changed / modified values.
I am not sure how it handles duplicate rows, but as you said, it should be easy finding these with a list box.
This might give you an idea how to start (or a possible start):
Re: How to compare 2 tables and put non-matching results?
This should handle records existing only in one table (i.e. missing / new in second table) and changed / modified values.
I am not sure how it handles duplicate rows, but as you said, it should be easy finding these with a list box.
Thank you so much for the reference/code.
I have used this code for comparison of sources.
See Script1 attached for reference.
First we load source (1.1) (trusted data) and suspect data (1.2)
There is NO joins between them:
QUALIFY *;
After that we allow the join:
UNQUALIFY *;
and formulate (1.3) the Delta (difference) between sources programmatically:
We allow QVW engine to join by the composite key (i.e. turning the composite key into a surrogate key):
KEY_JOINED
See this part of the script:
//(1.3) load Source Formulate the difference between the sources programmatically:
DELTA:
Concatenate (SOURCE)
LOAD
AutoNumberHash256 (StateProvinceID, TerritoryID,StateProvinceName,CountryRegionName) as Autonumbered_Hash,
AutoNumberHash256 (StateProvinceID, TerritoryID,StateProvinceName,CountryRegionName) as SSD_Autonumbered_Hash,
AutoNumberHash256 (StateProvinceID, TerritoryID) as KEY_JOINED,
StateProvinceID
,StateProvinceCode
,IsOnlyStateProvinceFlag
,StateProvinceName
,TerritoryID
,CountryRegionCode
,CountryRegionName
,'DELTA' as Table;
SQL SELECT *
from [AdventureWorks2008R2].[Person].[vStateProvinceCountryRegion_STRANGE_DATA];
INNER JOIN (SOURCE)
LOAD *
WHERE
"Only in One Table?"
OR "Hash Different?"
OR "Duplicate or Not Match?"
;
LOAD
KEY_JOINED
// (1.4) formulating the 3 data difference scenarios:
,if(count(KEY_JOINED)=1,-1, 0) as "Only in One Table?"
,if(maxstring(Autonumbered_Hash)<>minstring(Autonumbered_Hash),-1,0) as "Hash Different?"
,if(count(KEY_JOINED) <> 2, -1, 0 ) as "Duplicate or Not Match?"
RESIDENT SOURCE
GROUP BY KEY_JOINED
;
See the resulting log:
Connecting to UserDSN
Connected
SOURCE << [vStateProvinceCountryRegion] 181 lines fetched
Connecting to UserDSN_STRANGE_DATA
Connected
SOURCE_STRANGE_DATA << [vStateProvinceCountryRegion_STRANGE_DATA] 182 lines fetched
SOURCE << [vStateProvinceCountryRegion_STRANGE_DATA] 363 lines fetched
DELTA << SOURCE 4 lines fetched
See that only 4 rows populate the Delta recordset.
See that we exposed the source of each data as a column:
,'SOURCE' as Table; ...
,'DELTA' as Table;
See the end result:
there are 3 analytic flags, we can filter by them:
We have reached the goal:
identifying all the main scenarios:
1) Missing data (Quebec missing in suspect data)
2) new rows (NewProvince was missing in the source/correct data)
3) Misspelt rows (see Ontario misspelt, actually number of columns might be in hundreds, so it is very simplified example)
4) duplicate rows (see Alberta below)
The first column: "table" shows us the source of the data.
The only functional requirement that I did not reach was keeping both sources intact (because I had to concatenate two sources, source data now reduced only to delta/modified data).
Is there any way to make the script simpler?
Table | Only in One Table? | Hash Different? | Duplicate or Not Match? | CountryRegionCode | CountryRegionName | StateProvinceName |
SOURCE | 0 | 0 | -1 | CA | Canada | Alberta |
DELTA | 0 | 0 | -1 | CA | Canada | Alberta |
SOURCE | 0 | -1 | 0 | CA | Canada | Ontario |
DELTA | 0 | -1 | 0 | CA | Canada | <MISSPELT>Ontario |
SOURCE | -1 | 0 | -1 | CA | Canada | Quebec |
DELTA | -1 | 0 | -1 | CA | Canada | <NEWPROVINCE> |