Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 
cristian_av
Creator III
Creator III

Join-Qualify, vs Join-Concatenate vs Join-Load from QVD

Hi

I have the same pair of tables that have to be joined, from multiple databases (About 20).

I need to create one table with all of this tables, but I don't k now which is the best way to do this. I've tried the following:

In the real world, the pair are about 6 tables with attributed to be joined for each database.

1° Try:

-Connect to First Database

-Do a Join on each pair of tables

-Then Connect to Second Database

Concatenate and Join to First tables.

-Not working because the second Join is not working.

-I think it's the best way because I need to connect only one time to each database, and the databases will be separated in the load script. (The bad part it that it's not working )

-I think I will have to use qualify, but I don't now how to use here.

qvjoinnotworking.png

2° Try:

Load The Customers Table from All the Databases

Load the Income table from All the Databases

Join the Tables

-This works, but I need to connect many times to each database, and if for some reason I had troubles with one of these, I will need to edit the whole script, not only one "Tab".

QVMissedData.png

3° Try

-Connect to First Database

-Do a Join on each pair of tables

-Store on a QVD.

-Drop Table.

-Connect to Second Database

-Do a Join on each pair of tables

-Join with first QVD..


-I need to connect only one time to each database, and the databases will be separated in the load script.

-The bad part it that I need to create one QVD for each table, and load to memory twice(when connecting to DB and when loading from QVD), making the script slower.


QVMissedData.png

Finally, for some reason, with 2° and 3° try, I lost an ID. I think it's another trouble, but I'm not sure why I list that ID.

Hope somebody can explain 

a) What is the best approach to join multiple tables without having to connect multiple times to same server and without loading multiple times the same data to memory.

b) Why one key is lost with approach 2 and 3.

Thanks!

1 Solution

Accepted Solutions
marcus_sommer

Personally I wouldn't worry about the number of connections else about the number of records which are to pull and how much workload it caused. If this is significant in any way I would suggest to consider incremental load-appoaches and onyl pull new/changed data.

Quite common in such scenarios is the use of multiple stages in the ETL process, for example a 3-tier or 4-tier data-architecture with several generator-qvw's which pull the data (incremental) from the database and store them into qvd's. The second step would be to use several generator-qvw's which merge the qvd-data per join/mapping or concatenating into bigger data-files - maybe also with incremental approaches.

The third/fourth step would be to create from these and other data-files a (or several) datamodels and reports.

It sounds like much work but by bigger and/or complex tasks it's very useful to split them into multiple small parts which could be more easily developed/maintained and could be of course parallelized by doing this and by a probably automated execution.

- Marcus

View solution in original post

6 Replies
marcus_sommer

Personally I wouldn't worry about the number of connections else about the number of records which are to pull and how much workload it caused. If this is significant in any way I would suggest to consider incremental load-appoaches and onyl pull new/changed data.

Quite common in such scenarios is the use of multiple stages in the ETL process, for example a 3-tier or 4-tier data-architecture with several generator-qvw's which pull the data (incremental) from the database and store them into qvd's. The second step would be to use several generator-qvw's which merge the qvd-data per join/mapping or concatenating into bigger data-files - maybe also with incremental approaches.

The third/fourth step would be to create from these and other data-files a (or several) datamodels and reports.

It sounds like much work but by bigger and/or complex tasks it's very useful to split them into multiple small parts which could be more easily developed/maintained and could be of course parallelized by doing this and by a probably automated execution.

- Marcus

swuehl
MVP
MVP

a) The reason why your first script does not work as expected: After your first DB access & table join, you have a resident table with fields ID, Year and salary. In your second DB access & JOIN, the already existing salary field will be used as additional key field, and this not what you want.

You would need to create  a temporary table in the subsequent table joins and concatenate the temporary table to your final table:

[Customers]:

LOAD Year&ID as Key,

Year,

    ID,

    Name,

    RandomA

FROM

[Customers.xlsx]

(ooxml, embedded labels, table is Employee1);

Join (Customers)

LOAD Year&ID as Key,

    Salary

FROM

[Customers.xlsx]

(ooxml, embedded labels, table is Income1);

//OLEDB CONNECT TO.... Second Database

TMP:

NoConcatenate

LOAD Year&ID as Key,

Year,

    ID,

    Name,

    RandomA

FROM

[Customers.xlsx]

(ooxml, embedded labels, table is Employee2);

Join (TMP)

LOAD Year&ID as Key,

    Salary

FROM

[Customers.xlsx]

(ooxml, embedded labels, table is Income2);

Concatenate (Customers)

LOAD * Resident TMP;

Drop Table TMP;



You probably noticed that this is basically your script #3 using temporary resident tables instead of QVDs.

b) if you want to get rid of the ID 6 in your resulting table, you would need to use a LEFT JOIN instead of a JOIN

Key ID Year Name RandomA Salary
2016112016Juansds1000
2016222016Pedrofdf2000
2016332016Charleswer3000
2016442016Michaelwtt4000
2016552016Diegower5000
20161001002016Mariadfg1110
20161201202016Juanafdgfd1210
20161301302016Carlaert1310
20161401402016Franciscatrt1410
20161501502016Lisetteweqq1510
swuehl
MVP
MVP

If you can create a SELECT statement that performs the JOIN on the remote DBMS, you can create a simple loop that iterates over all data bases, connects and read the join'ed tables, autoconcatenating to your final table.

cristian_av
Creator III
Creator III
Author

Thanks! How can I create a loop that iterates over databases? I know how to create a loop.. but how do I change the string connection in each loop?

cristian_av
Creator III
Creator III
Author

Thanks!. I was trying someway to do the second join with qualify with no luck.

But your solution is more elegant, and works .

cristian_av
Creator III
Creator III
Author

Thanks. I thought that multiple connections to the same server would be resource-intensive.

Your idea of multiple qvd is a better one, but as you said, it sound like much work, but seems to be easy to maintain.