Skip to main content
Announcements
Global Transformation Awards! Applications are now open. Submit Entry
cancel
Showing results for 
Search instead for 
Did you mean: 
Not applicable

Visualize disparate Data from SQL Sources

We need to compare back-end SQL data from different sources.

Let's say we have an ETL or Operational process.

It processes and modifies  data (in mlns).

99.9% of the data is the same.

We need to identify the "Delta" (missing, new, redundant and modified data).

Data is coming from different sources (different Servers etc).

That is why we can not use full outer join and where not exists in SQL.

We have to do similar functionality programmatically in QLV Load.

We have one SOURCE of data (=CORRECT) data.

Lets call it: "SOURCE".

We have a second source (or ETL target) = SUSPECT data.

Lets call it: "STORE SOURCE_STRANGE_DATA".

See data scaffolding:

in SQL Server 2008R2 DB AdventureWorks2008R2

New view created: [Person].[vStateProvinceCountryRegion_STRANGE_DATA]

See Att.1.

It reproduces all rows from the underlying view: vStateProvinceCountryRegion

correctly except the following 4 rows ("STORE SOURCE_STRANGE_DATA"):

1) Misspelt existing row: '<MISSPELT>Ontario'

2) Existing (correct) row duplicated: Alberta duplicated

3) Missing row: Quebec is missing

4) NEW row: <NEWPROVINCE> (StateProvinceID=999, i.e. new key identication)

See CORRECT list of provinces:

ABAlbertaCA
BCBritish ColumbiaCA
NBBrunswickCA
LBLabradorCA
MBManitobaCA
NFNewfoundlandCA
NTNorthwest TerritoriesCA
NSNova ScotiaCA
ONOntarioCA
PEPrince Edward IslandCA
QCQuebecCA
SKSaskatchewanCA
YTYukon TerritoryCA

See Suspect List of Provinces ("STORE SOURCE_STRANGE_DATA".):

(Quebec is missing from the list):

  

ON <MISSPELT>OntarioCA
NN<NEWPROVINCE>CA
AB AlbertaCA
AB AlbertaCA
BCBritish ColumbiaCA
NBBrunswickCA
LBLabradorCA
MBManitobaCA
NFNewfoundlandCA
NTNorthwest TerritoriesCA
NSNova ScotiaCA
PEPrince Edward IslandCA
SKSaskatchewanCA
YTYukon Territory

CA

Functional purpose:

Identify all 4 scenarios, listed above.

1) Keep both sources intact (putting them in the separate tab).

2) Programmatically identify "Delta" (any row that it New/Missing/Changed/Duplicate).

List of columns (used for comparison) should be easily modified, assume not reliable composite key.

I have come close to the solution, but still it is pretty bulky.

This solution should be short and robust and may be (auto-generated) code for each of the hundreds of the table  (being compared).

See the script: (it uses separate DSNs for sources, assuming different servers/connections):

ODBC CONNECT TO UserDSN;

SOURCE:

LOAD

AutoNumberHash256 (StateProvinceID, TerritoryID,StateProvinceName,CountryRegionName) as Autonumbered_Hash,

  StateProvinceID

  ,StateProvinceCode

  ,IsOnlyStateProvinceFlag

  ,StateProvinceName

  ,TerritoryID

  ,CountryRegionCode

  ,CountryRegionName;

SQL SELECT *

// correct data:

from [AdventureWorks2008R2].[Person].[vStateProvinceCountryRegion];

STORE SOURCE INTO C:\temp\SOURCE.QVD (qvd);

// ----------------------------------------------------------------------------------------------------------------

QUALIFY *;

UNQUALIFY Autonumbered_Hash;

ODBC CONNECT TO UserDSN_STRANGE_DATA;

SOURCE_STRANGE_DATA:

// the only way to see(load) new, redundant, chnaged values is right keep (but MISSING values DISSAPEAR from source!)

RIGHT KEEP

LOAD AutoNumberHash256 (StateProvinceID, TerritoryID,StateProvinceName,CountryRegionName) as Autonumbered_Hash,

// want to keep it separately for dirty flag detection:

AutoNumberHash256 (StateProvinceID, TerritoryID,StateProvinceName,CountryRegionName) as SSD_Autonumbered_Hash,

  StateProvinceID

  ,StateProvinceCode

  ,IsOnlyStateProvinceFlag

  ,StateProvinceName

  ,TerritoryID

  ,CountryRegionCode

  ,CountryRegionName; 

SQL SELECT *

// data for comparison:

from [AdventureWorks2008R2].[Person].[vStateProvinceCountryRegion_STRANGE_DATA];

STORE SOURCE_STRANGE_DATA INTO C:\temp\SOURCE_STRANGE_DATA.QVD (qvd);

QUALIFY *;

UNQUALIFY Autonumbered_Hash;

DELTA:

RIGHT KEEP

LOAD

(-1)* (IsNull(Lookup ('Autonumbered_Hash', 'Autonumbered_Hash', Autonumbered_Hash, 'SOURCE'))) as Delta_FLAG

,*

FROM C:\temp\SOURCE_STRANGE_DATA.QVD (qvd)

WHERE 1=1

//and IsNull(Lookup ('Autonumbered_Hash', 'Autonumbered_Hash', Autonumbered_Hash, 'SOURCE')) <> 0

;

// end script

Lets see the results:

  

DELTA.Delta_FLAGDELTA.SOURCE_STRANGE_DATA.StateProvinceCodeDELTA.SOURCE_STRANGE_DATA.StateProvinceNameDELTA.SOURCE_STRANGE_DATA.CountryRegionName
1NN<NEWPROVINCE>Canada
1ON<MISSPELT>OntarioCanada
0ABAlbertaCanada
0BCBritish ColumbiaCanada
0LBLabradorCanada
0MBManitobaCanada
0NBBrunswickCanada
0NFNewfoundlandCanada
0NSNova ScotiaCanada
0NTNorthwest TerritoriesCanada
0PEPrince Edward IslandCanada
0SKSaskatchewanCanada
0YTYukon TerritoryCanada

By exposing "Delta_Flag" as a control in QLV, we can identify the modified and new data.

What is incorrect:

1) Quebec (missing row) is NOT shown and not identified.

Moreover, the right keep removed it from the source.

We need the functional analogue of the FULL OUTER KEEP.

How to reach it?

2) Duplicate row not shown (Alberta).

This is easy however, with a separate QVW list box control (expression showing non-distinct count).

Alberta2
British Columbia1
Brunswick1
Labrador1
Manitoba1
Newfoundland1
Northwest Territories1
Nova Scotia1
Prince Edward Island1
Saskatchewan1
Yukon Territory1

How to make it (duplicate row) a programmatic flag in script?

3)  How to make the script smaller and robust (business want to visualize both sources intact separately, and delta separately though).

4) Sorry, I do not know how to use LOAD ... KEEP so NOT to affect the previously uploaded table at all (like full outer keep) in any way, see on example Quebec dissapeared from the source above (using right keep).

Left keep would discard the new rows in second (suspect) source though.

We need to keep both.

Source: (Quebec was filtered out by right keep):

Alberta
British Columbia
Brunswick
Labrador
Manitoba
Newfoundland
Northwest Territories
Nova Scotia
Prince Edward Island
Saskatchewan

Yukon Territory

Can anybody show in detail how to combine left and right keep so not to affect the previously loaded table?

thank you in advance,

1 Solution

Accepted Solutions
swuehl
MVP
MVP

This might give you an idea how to start (or a possible start):

Re: How to compare 2 tables and put non-matching results?

This should handle records existing only in one table (i.e. missing / new in second table) and changed / modified values.

I am not sure how it handles duplicate rows, but as you said, it should be easy finding these with a list box.

View solution in original post

2 Replies
swuehl
MVP
MVP

This might give you an idea how to start (or a possible start):

Re: How to compare 2 tables and put non-matching results?

This should handle records existing only in one table (i.e. missing / new in second table) and changed / modified values.

I am not sure how it handles duplicate rows, but as you said, it should be easy finding these with a list box.

Not applicable
Author

Thank you so much for the reference/code.

I have used this code for comparison of sources.

See Script1 attached for reference.

First we load source (1.1)  (trusted data) and suspect data (1.2)

There is NO joins between them:

QUALIFY *;

After that we allow the join:

UNQUALIFY *;

and formulate (1.3) the Delta (difference) between sources programmatically:

We allow QVW engine to join by the composite key (i.e. turning the composite key into a surrogate key):

KEY_JOINED

See this part of the script:

//(1.3) load Source Formulate the difference between the sources programmatically:

DELTA:

Concatenate (SOURCE)

LOAD

AutoNumberHash256 (StateProvinceID, TerritoryID,StateProvinceName,CountryRegionName) as Autonumbered_Hash,

AutoNumberHash256 (StateProvinceID, TerritoryID,StateProvinceName,CountryRegionName) as SSD_Autonumbered_Hash,

AutoNumberHash256 (StateProvinceID, TerritoryID) as KEY_JOINED,

  StateProvinceID

   ,StateProvinceCode

   ,IsOnlyStateProvinceFlag

   ,StateProvinceName

   ,TerritoryID

   ,CountryRegionCode

   ,CountryRegionName

,'DELTA' as Table;

SQL SELECT *

from [AdventureWorks2008R2].[Person].[vStateProvinceCountryRegion_STRANGE_DATA];

INNER JOIN (SOURCE)

LOAD *

WHERE

"Only in One Table?"

OR  "Hash Different?"

OR "Duplicate or Not Match?"

;

LOAD

KEY_JOINED

// (1.4) formulating the 3 data difference scenarios:

,if(count(KEY_JOINED)=1,-1, 0) as "Only in One Table?"

,if(maxstring(Autonumbered_Hash)<>minstring(Autonumbered_Hash),-1,0) as "Hash Different?"

,if(count(KEY_JOINED) <> 2, -1, 0 ) as "Duplicate or Not Match?"

RESIDENT SOURCE

GROUP BY KEY_JOINED

;

See the resulting log:

Connecting to UserDSN

Connected

SOURCE << [vStateProvinceCountryRegion] 181 lines fetched

Connecting to UserDSN_STRANGE_DATA

Connected

SOURCE_STRANGE_DATA << [vStateProvinceCountryRegion_STRANGE_DATA] 182 lines fetched

SOURCE << [vStateProvinceCountryRegion_STRANGE_DATA] 363 lines fetched

DELTA << SOURCE 4 lines fetched


See that only 4 rows populate the Delta recordset.


See that we exposed the source of each data as a column:

,'SOURCE' as Table; ...

,'DELTA' as Table;


See the end result:

there are 3 analytic flags, we can filter by them:

        

We have reached the goal:

identifying all the main scenarios:

1) Missing data (Quebec missing in suspect data)

2) new rows (NewProvince was missing in the source/correct data)

3) Misspelt rows (see Ontario misspelt, actually number of columns might be in hundreds, so it is very simplified example)

4) duplicate rows (see Alberta below)

The first column: "table" shows us the source of the data.

The only functional requirement that I did not reach was keeping both sources intact (because I had to concatenate two sources, source data now reduced only to delta/modified data).

Is there any way to make the script simpler?

       

TableOnly in One Table?Hash Different?Duplicate or Not Match?CountryRegionCodeCountryRegionNameStateProvinceName
SOURCE00-1CACanadaAlberta
DELTA00-1CACanadaAlberta
SOURCE0-10CACanadaOntario
DELTA0-10CACanada<MISSPELT>Ontario
SOURCE-10-1CACanadaQuebec
DELTA-10-1CACanada<NEWPROVINCE>