I struggle with joining two tables containing wildcards. Here is sample table:
ClientID
SubID
Name
Category
Stream
CombinedID
2346
String1
Name1
Category1
Stream1
2346String1
2346
String2
Name1
Category2
Stream2
2346String2
2378
String3
Name2
Category2
Stream2
2378String3
2378
String4
Name2
Category3
Stream3
2378String4
2470
*
Name3
Category2
Stream3
2470*
2646
*
TBD
Category3
Stream3
2646*
*
*
TBD
Other
Other
**
What I need basically:
- If there is defined ClientID and SubID, use this as unique ID when joining towards final data table
- If there is Defined ClientID only, join only based on ID (There are multiple SubIDs that e.g. 2470 can get, but it will always be linked to same Name, Category and Stream)
- For other, not defined matches in the table (No hard linked Client ID and SubID) use Values "TBD; Other; Other".
I need this to implement into the load script.
The sample table is taken from Excel, so it can be dynamically changed in case some new relationship needs to be defined. Full dataset is growing steadily and currently has over 160k rows with at least 30 fields. Each month there is approximately 2000 more rows. This dataset contains the ClientID and SubID (which combination I've wanted to use as a primary key shown in the table as CombinedID). With this approach I got stuck because of the wildcard.
I've also tried pure joining (with "*" or with empty cells), but that ended in a lot of empty values.
I've also tried to create Cartesian product for each combination of ClientID and SubID, but that ended in over 18 million rows.
I've also tried Rob Wunderlich's approach, but that is pretty limited for my data set (as expression length can be only over 8k characters).
Thank you for any help or pointing towards direction