Skip to main content
Announcements
SYSTEM MAINTENANCE: Thurs., Sept. 19, 1 AM ET, Platform will be unavailable for approx. 60 minutes.
cancel
Showing results for 
Search instead for 
Did you mean: 
Not applicable

Normalization of Data

Hi

we are loading the data from multiple sources. how to avoid the duplicates while loading the raw data??

Below is the sample data:

ID         Name

0111    IBM

0111    VIBM

0111    IBM CORPORATION

0422    V-SS&C TECHNOLOGIES INC

0422    V-SSC TECHNOLOGIES INC

0171     V-V-RECORDS MANAGEMENT INC

3674     V-EMC CORPORATION

3199     V- AMERICA INTERNATIONAL CORPORATION

3789    V-AIRTEL INC

3789    V-V-AIRTEL INC

7 Replies
sunny_talwar

How would you decide which Name to load. For example When ID = 0111 which Name would go? IBM, VIBM or IBM CROPORATION?

swuehl
MVP
MVP

Have a look at

Data Cleansing

In your case, create a table with customer agreed Names for each ID, then e.g. MAP the values when loading your source data.

Not applicable
Author

I have 13656 rows data. I  have to create manually????

Not applicable
Author

In this case IBM is the root data for loading.

sunny_talwar

We need to have consistency. If it is always going to be the first name then you can use FirstValue() function. Something like this:

LOAD ID,

          FirstValue(Name) as Name

FROM Source

Group By ID;

swuehl
MVP
MVP

Kishore Kumar Karakavalasa wrote:

I have 13656 rows data. I  have to create manually????

Hopefully, there is a table with agreed Names for each ID somewhere in your systems (call it a master dimension table or leading data table for that dimension), so you can just load this table as dimension table in QV.

Peter_Cammaert
Partner - Champion III
Partner - Champion III

Or, to keep it simple: decide which one of your sources is "leading". That will probably be the data source that:

  • contains the most ID-Name combinations (consider adding missing ones)
  • is easy to maintain
  • has the fewest incorrect Name values (consider correcting the ones that are wrong)

Solutions to merge master data tables from different systems rely either on

  • one system being "boss" and being maintained as such
  • or some kind of mapping table like Stefan suggested
  • or a ruleset to guide the reload engine when decisions have to be made about which Name to retain.

Peter