2 Replies Latest reply: Dec 14, 2015 5:38 PM by Alexei Akimov RSS

    Visualize disparate Data from SQL Sources

    Alexei Akimov

      We need to compare back-end SQL data from different sources.

      Let's say we have an ETL or Operational process.

      It processes and modifies  data (in mlns).

      99.9% of the data is the same.

      We need to identify the "Delta" (missing, new, redundant and modified data).

       

       

      Data is coming from different sources (different Servers etc).

      That is why we can not use full outer join and where not exists in SQL.

      We have to do similar functionality programmatically in QLV Load.

       

       

      We have one SOURCE of data (=CORRECT) data.

      Lets call it: "SOURCE".

       

       

      We have a second source (or ETL target) = SUSPECT data.

      Lets call it: "STORE SOURCE_STRANGE_DATA".

       

       

      See data scaffolding:

      in SQL Server 2008R2 DB AdventureWorks2008R2

      New view created: [Person].[vStateProvinceCountryRegion_STRANGE_DATA]

      See Att.1.

       

      It reproduces all rows from the underlying view: vStateProvinceCountryRegion

      correctly except the following 4 rows ("STORE SOURCE_STRANGE_DATA"):

       

      1) Misspelt existing row: '<MISSPELT>Ontario'

      2) Existing (correct) row duplicated: Alberta duplicated

      3) Missing row: Quebec is missing

      4) NEW row: <NEWPROVINCE> (StateProvinceID=999, i.e. new key identication)

       

      See CORRECT list of provinces:

       

      ABAlbertaCA
      BCBritish ColumbiaCA
      NBBrunswickCA
      LBLabradorCA
      MBManitobaCA
      NFNewfoundlandCA
      NTNorthwest TerritoriesCA
      NSNova ScotiaCA
      ONOntarioCA
      PEPrince Edward IslandCA
      QCQuebecCA
      SKSaskatchewanCA
      YTYukon TerritoryCA

       

      See Suspect List of Provinces ("STORE SOURCE_STRANGE_DATA".):

      (Quebec is missing from the list):

        

      ON <MISSPELT>OntarioCA
      NN<NEWPROVINCE>CA
      AB AlbertaCA
      AB AlbertaCA
      BCBritish ColumbiaCA
      NBBrunswickCA
      LBLabradorCA
      MBManitobaCA
      NFNewfoundlandCA
      NTNorthwest TerritoriesCA
      NSNova ScotiaCA
      PEPrince Edward IslandCA
      SKSaskatchewanCA
      YTYukon Territory

      CA

       

      Functional purpose:

      Identify all 4 scenarios, listed above.

      1) Keep both sources intact (putting them in the separate tab).

      2) Programmatically identify "Delta" (any row that it New/Missing/Changed/Duplicate).

       

      List of columns (used for comparison) should be easily modified, assume not reliable composite key.

       

      I have come close to the solution, but still it is pretty bulky.

      This solution should be short and robust and may be (auto-generated) code for each of the hundreds of the table  (being compared).

       

      See the script: (it uses separate DSNs for sources, assuming different servers/connections):

       

       

      ODBC CONNECT TO UserDSN;

      SOURCE:

      LOAD

      AutoNumberHash256 (StateProvinceID, TerritoryID,StateProvinceName,CountryRegionName) as Autonumbered_Hash,

        StateProvinceID

        ,StateProvinceCode

        ,IsOnlyStateProvinceFlag

        ,StateProvinceName

        ,TerritoryID

        ,CountryRegionCode

        ,CountryRegionName;

       

      SQL SELECT *

      // correct data:

      from [AdventureWorks2008R2].[Person].[vStateProvinceCountryRegion];

       

      STORE SOURCE INTO C:\temp\SOURCE.QVD (qvd);

      // ----------------------------------------------------------------------------------------------------------------

       

      QUALIFY *;

      UNQUALIFY Autonumbered_Hash;

      ODBC CONNECT TO UserDSN_STRANGE_DATA;

       

      SOURCE_STRANGE_DATA:

      // the only way to see(load) new, redundant, chnaged values is right keep (but MISSING values DISSAPEAR from source!)

      RIGHT KEEP

      LOAD AutoNumberHash256 (StateProvinceID, TerritoryID,StateProvinceName,CountryRegionName) as Autonumbered_Hash,

      // want to keep it separately for dirty flag detection:

      AutoNumberHash256 (StateProvinceID, TerritoryID,StateProvinceName,CountryRegionName) as SSD_Autonumbered_Hash,

        StateProvinceID

        ,StateProvinceCode

        ,IsOnlyStateProvinceFlag

        ,StateProvinceName

        ,TerritoryID

        ,CountryRegionCode

        ,CountryRegionName; 

      SQL SELECT *

      // data for comparison:

      from [AdventureWorks2008R2].[Person].[vStateProvinceCountryRegion_STRANGE_DATA];

       

      STORE SOURCE_STRANGE_DATA INTO C:\temp\SOURCE_STRANGE_DATA.QVD (qvd);

       

      QUALIFY *;

      UNQUALIFY Autonumbered_Hash;

       

      DELTA:

      RIGHT KEEP

      LOAD

      (-1)* (IsNull(Lookup ('Autonumbered_Hash', 'Autonumbered_Hash', Autonumbered_Hash, 'SOURCE'))) as Delta_FLAG

      ,*

      FROM C:\temp\SOURCE_STRANGE_DATA.QVD (qvd)

      WHERE 1=1

      //and IsNull(Lookup ('Autonumbered_Hash', 'Autonumbered_Hash', Autonumbered_Hash, 'SOURCE')) <> 0

      ;

      // end script

       

      Lets see the results:

       

        

      DELTA.Delta_FLAGDELTA.SOURCE_STRANGE_DATA.StateProvinceCodeDELTA.SOURCE_STRANGE_DATA.StateProvinceNameDELTA.SOURCE_STRANGE_DATA.CountryRegionName
      1NN<NEWPROVINCE>Canada
      1ON<MISSPELT>OntarioCanada
      0ABAlbertaCanada
      0BCBritish ColumbiaCanada
      0LBLabradorCanada
      0MBManitobaCanada
      0NBBrunswickCanada
      0NFNewfoundlandCanada
      0NSNova ScotiaCanada
      0NTNorthwest TerritoriesCanada
      0PEPrince Edward IslandCanada
      0SKSaskatchewanCanada
      0YTYukon TerritoryCanada

       

      By exposing "Delta_Flag" as a control in QLV, we can identify the modified and new data.

       

      What is incorrect:

      1) Quebec (missing row) is NOT shown and not identified.

      Moreover, the right keep removed it from the source.

      We need the functional analogue of the FULL OUTER KEEP.

      How to reach it?

       

      2) Duplicate row not shown (Alberta).

      This is easy however, with a separate QVW list box control (expression showing non-distinct count).

       

      Alberta2
      British Columbia1
      Brunswick1
      Labrador1
      Manitoba1
      Newfoundland1
      Northwest Territories1
      Nova Scotia1
      Prince Edward Island1
      Saskatchewan1
      Yukon Territory1

      How to make it (duplicate row) a programmatic flag in script?

       

      3)  How to make the script smaller and robust (business want to visualize both sources intact separately, and delta separately though).

       

      4) Sorry, I do not know how to use LOAD ... KEEP so NOT to affect the previously uploaded table at all (like full outer keep) in any way, see on example Quebec dissapeared from the source above (using right keep).

       

      Left keep would discard the new rows in second (suspect) source though.

      We need to keep both.

       

      Source: (Quebec was filtered out by right keep):

      Alberta
      British Columbia
      Brunswick
      Labrador
      Manitoba
      Newfoundland
      Northwest Territories
      Nova Scotia
      Prince Edward Island
      Saskatchewan

      Yukon Territory

       

      Can anybody show in detail how to combine left and right keep so not to affect the previously loaded table?

       

      thank you in advance,

        • Re: Visualize disparate Data from SQL Sources
          Stefan Wühl

          This might give you an idea how to start (or a possible start):

          Re: How to compare 2 tables and put non-matching results?

           

          This should handle records existing only in one table (i.e. missing / new in second table) and changed / modified values.

           

          I am not sure how it handles duplicate rows, but as you said, it should be easy finding these with a list box.

            • Re: Visualize disparate Data from SQL Sources
              Alexei Akimov

              Thank you so much for the reference/code.

              I have used this code for comparison of sources.

              See Script1 attached for reference.

               

              First we load source (1.1)  (trusted data) and suspect data (1.2)

               

              There is NO joins between them:

              QUALIFY *;

               

               

              After that we allow the join:

              UNQUALIFY *;

              and formulate (1.3) the Delta (difference) between sources programmatically:

              We allow QVW engine to join by the composite key (i.e. turning the composite key into a surrogate key):

              KEY_JOINED

               

               

               

              See this part of the script:

               

              //(1.3) load Source Formulate the difference between the sources programmatically:

              DELTA:

              Concatenate (SOURCE)

              LOAD

              AutoNumberHash256 (StateProvinceID, TerritoryID,StateProvinceName,CountryRegionName) as Autonumbered_Hash,

              AutoNumberHash256 (StateProvinceID, TerritoryID,StateProvinceName,CountryRegionName) as SSD_Autonumbered_Hash,

              AutoNumberHash256 (StateProvinceID, TerritoryID) as KEY_JOINED,

                StateProvinceID

                 ,StateProvinceCode

                 ,IsOnlyStateProvinceFlag

                 ,StateProvinceName

                 ,TerritoryID

                 ,CountryRegionCode

                 ,CountryRegionName

              ,'DELTA' as Table;

              SQL SELECT *

              from [AdventureWorks2008R2].[Person].[vStateProvinceCountryRegion_STRANGE_DATA];

               

              INNER JOIN (SOURCE)

              LOAD *

              WHERE

              "Only in One Table?"

              OR  "Hash Different?"

              OR "Duplicate or Not Match?"

              ;

              LOAD

              KEY_JOINED

              // (1.4) formulating the 3 data difference scenarios:

              ,if(count(KEY_JOINED)=1,-1, 0) as "Only in One Table?"

              ,if(maxstring(Autonumbered_Hash)<>minstring(Autonumbered_Hash),-1,0) as "Hash Different?"

              ,if(count(KEY_JOINED) <> 2, -1, 0 ) as "Duplicate or Not Match?"

              RESIDENT SOURCE

              GROUP BY KEY_JOINED

              ;

               

               

              See the resulting log:

               

              Connecting to UserDSN

              Connected

              SOURCE << [vStateProvinceCountryRegion] 181 lines fetched

              Connecting to UserDSN_STRANGE_DATA

              Connected

              SOURCE_STRANGE_DATA << [vStateProvinceCountryRegion_STRANGE_DATA] 182 lines fetched

              SOURCE << [vStateProvinceCountryRegion_STRANGE_DATA] 363 lines fetched

              DELTA << SOURCE 4 lines fetched


              See that only 4 rows populate the Delta recordset.


              See that we exposed the source of each data as a column:

              ,'SOURCE' as Table; ...

              ,'DELTA' as Table;


               

              See the end result:

              there are 3 analytic flags, we can filter by them:

                      

              We have reached the goal:

              identifying all the main scenarios:

              1) Missing data (Quebec missing in suspect data)

              2) new rows (NewProvince was missing in the source/correct data)

              3) Misspelt rows (see Ontario misspelt, actually number of columns might be in hundreds, so it is very simplified example)

              4) duplicate rows (see Alberta below)

               

              The first column: "table" shows us the source of the data.

               

              The only functional requirement that I did not reach was keeping both sources intact (because I had to concatenate two sources, source data now reduced only to delta/modified data).

               

              Is there any way to make the script simpler?

               

                     

              TableOnly in One Table?Hash Different?Duplicate or Not Match?CountryRegionCodeCountryRegionNameStateProvinceName
              SOURCE00-1CACanadaAlberta
              DELTA00-1CACanadaAlberta
              SOURCE0-10CACanadaOntario
              DELTA0-10CACanada<MISSPELT>Ontario
              SOURCE-10-1CACanadaQuebec
              DELTA-10-1CACanada<NEWPROVINCE>