Solved: Re: Join one csv file to two other csv files and k... - Qlik Community

repetto74 · ‎2020-04-19

Hi Guys,

How can I join three csv files with the script but need to keep the number of total records of the first CSV. The main csv file (OrderFact) has 112650 records and the field OrderID is in common with the second and third table I want to join (Order and Review). With an INNER JOIN between the first two files I will get a merged files with 112650 records for all fields but then any consecutive JOIN of the OrderFact table with the third one (Review) will give 113425 records for all fields which is not correct.

Where I am doing wrong with the script?

Here is the script.

OrderFact:
LOAD
order_id,
order_item_id,
product_id,
seller_id,
shipping_limit_date,
price,
freight_value
FROM [lib://DataFiles/olist_order_items_dataset.csv]
(txt, codepage is 28591, embedded labels, delimiter is ',', msq);

INNER JOIN(OrderFact)

Order:
LOAD
order_id,
customer_id,
order_status,
order_purchase_timestamp,
order_approved_at,
order_delivered_carrier_date,
order_delivered_customer_date,
order_estimated_delivery_date
FROM [lib://DataFiles/olist_orders_dataset.csv]
(txt, codepage is 28591, embedded labels, delimiter is ',', msq);

JOIN(OrderFact)

Review:
LOAD
review_id,
order_id

FROM [lib://DataFiles/olist_order_reviews_dataset.csv]
(txt, utf8, embedded labels, delimiter is ',', msq);

Vegar · ‎2020-04-20

Again my guess would be that you have multiple rows with the same order_id in you join table. Could it be that you have more than one review per order?

If it is acceptable to only get one review_id per order then you could use applymap, with applymap you are guaranteed not to create duplicate rows. Take a look at the script below.

Map_Review:
MAPPING LOAD
order_id,
review_id
FROM
[lib://DataFiles/olist_order_reviews_dataset.csv]
(txt, utf8, embedded labels, delimiter is ',', msq);

OrderFact:
LOAD
order_id,
applymap('Map_Review', order_id, '<No Review>') as [Order review],
order_item_id,
product_id,
seller_id,
shipping_limit_date,
price,
freight_value
FROM [lib://DataFiles/olist_order_items_dataset.csv]
(txt, codepage is 28591, embedded labels, delimiter is ',', msq);

INNER JOIN(OrderFact)
LOAD
order_id,
customer_id,
order_status,
order_purchase_timestamp,
order_approved_at,
order_delivered_carrier_date,
order_delivered_customer_date,
order_estimated_delivery_date
FROM [lib://DataFiles/olist_orders_dataset.csv]
(txt, codepage is 28591, embedded labels, delimiter is ',', msq);

If it is not acceptable with one review per order then I would consider not to perform any join with the review table, just leave it as a dimension table in the data model. Like this:

OrderFact:
LOAD
order_id,
applymap('Map_Review', order_id, '<No Review>') as [Order review],
order_item_id,
product_id,
seller_id,
shipping_limit_date,
price,
freight_value
FROM [lib://DataFiles/olist_order_items_dataset.csv]
(txt, codepage is 28591, embedded labels, delimiter is ',', msq);

INNER JOIN(OrderFact)
LOAD
order_id,
customer_id,
order_status,
order_purchase_timestamp,
order_approved_at,
order_delivered_carrier_date,
order_delivered_customer_date,
order_estimated_delivery_date
FROM [lib://DataFiles/olist_orders_dataset.csv]
(txt, codepage is 28591, embedded labels, delimiter is ',', msq);

Review:
LOAD
order_id,
review_id
FROM
[lib://DataFiles/olist_order_reviews_dataset.csv]
(txt, utf8, embedded labels, delimiter is ',', msq);

Vegar
Qlik Community MVP

View solution in original post

Vegar · ‎2020-04-19

You probably got multiple rows with the same Order_id in the Order table.

To identify which orders. Try to load only the Order table and then create a table chart with "Order_id as a dimension and count(order_id) as an expression. If any of the order_id counts more than 1 then you have duplicates in your order. With duplicates in order table you will duplicate the OrderFacts transactions as well when performing an join.

Vegar
Qlik Community MVP

repetto74 · ‎2020-04-19

Hi Vegar,

Thanks for your reply on this. Please kindly review my first message as I made some amendments.

With the INNER JOIN I can merge ORDERFACT and ORDER with the correct number of records as a result but then I want also to add the Review ID of the third csv but any consecutive JOIN will always give a wrong number of records on the merged table.

Order_id in the Order table is unique but order_id in the OrderFact is repeated as many times as there are order lines for the same Order_id. Review_id is also unique for any Review made by the customer.

I may try to INNER JOIN the Review table with the Order table where Order_id is unique then?

Vegar · ‎2020-04-20