Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 
repetto74
Contributor II
Contributor II

Join one csv file to two other csv files and keep number of records of the first file

Hi Guys,

How can I join three csv files with the script but need to keep the number of total records of the first CSV. The main csv file (OrderFact) has 112650 records and the field OrderID is in common with the second and third table I want to join (Order and Review). With an INNER JOIN between the first two files I will get a merged files with 112650 records for all fields but then any consecutive JOIN of the OrderFact table with the third one (Review) will give 113425 records for all fields which is not correct. 

Where I am doing wrong with the script?

Here is the script. 

OrderFact:
LOAD
order_id,
order_item_id,
product_id,
seller_id,
shipping_limit_date,
price,
freight_value
FROM [lib://DataFiles/olist_order_items_dataset.csv]
(txt, codepage is 28591, embedded labels, delimiter is ',', msq);

INNER JOIN(OrderFact)

Order:
LOAD
order_id,
customer_id,
order_status,
order_purchase_timestamp,
order_approved_at,
order_delivered_carrier_date,
order_delivered_customer_date,
order_estimated_delivery_date
FROM [lib://DataFiles/olist_orders_dataset.csv]
(txt, codepage is 28591, embedded labels, delimiter is ',', msq);

JOIN(OrderFact)

 Review:
LOAD
review_id,
 order_id

 FROM [lib://DataFiles/olist_order_reviews_dataset.csv]
(txt, utf8, embedded labels, delimiter is ',', msq);

Labels (2)
1 Solution

Accepted Solutions
Vegar
MVP
MVP

Again my guess would be that you have multiple rows with the same order_id in you join table. Could it be that you have more than one review per order?

If it is acceptable to only get one review_id per order then you could use applymap, with applymap you are guaranteed not to create duplicate rows. Take a look at the script below. 


Map_Review:
MAPPING LOAD
  order_id,
  review_id
 
FROM
  [lib://DataFiles/olist_order_reviews_dataset.csv] 
  (txt, utf8, embedded labels, delimiter is ',', msq);

OrderFact:
LOAD
  order_id,
  applymap('Map_Review', order_id, '<No Review>') as [Order review],
  order_item_id,

  product_id,
  seller_id,
  shipping_limit_date,
  price,
  freight_value
FROM [lib://DataFiles/olist_order_items_dataset.csv]
  (txt, codepage is 28591, embedded labels, delimiter is ',', msq);

INNER JOIN(OrderFact)
LOAD
  order_id,
  customer_id,
  order_status,
  order_purchase_timestamp,
  order_approved_at,
  order_delivered_carrier_date,
  order_delivered_customer_date,
  order_estimated_delivery_date
FROM [lib://DataFiles/olist_orders_dataset.csv]
  (txt, codepage is 28591, embedded labels, delimiter is ',', msq);

 

 

If it is not acceptable with one review per order then I would consider not to perform any join with the review table,  just leave it as a dimension table in the data model. Like this:

OrderFact:
LOAD
  order_id,
  applymap('Map_Review', order_id, '<No Review>') as [Order review],
  order_item_id,

  product_id,
  seller_id,
  shipping_limit_date,
  price,
  freight_value
FROM [lib://DataFiles/olist_order_items_dataset.csv]
  (txt, codepage is 28591, embedded labels, delimiter is ',', msq);

INNER JOIN(OrderFact)
LOAD
  order_id,
  customer_id,
  order_status,
  order_purchase_timestamp,
  order_approved_at,
  order_delivered_carrier_date,
  order_delivered_customer_date,
  order_estimated_delivery_date
FROM [lib://DataFiles/olist_orders_dataset.csv]
  (txt, codepage is 28591, embedded labels, delimiter is ',', msq);

Review:
LOAD
  order_id,
  review_id
 
FROM
  [lib://DataFiles/olist_order_reviews_dataset.csv] 
  (txt, utf8, embedded labels, delimiter is ',', msq);

 

View solution in original post

5 Replies
Vegar
MVP
MVP

You probably got multiple rows with the same Order_id in the Order table.

To identify which orders. Try to  load only the Order table and then create a table chart with "Order_id as a dimension and count(order_id) as an expression. If any of the order_id counts more than 1 then you have duplicates in your order. With duplicates in order table you will duplicate the OrderFacts transactions as well when performing an join. 

repetto74
Contributor II
Contributor II
Author

Hi Vegar,

Thanks for your reply on this. Please kindly review my first message as I made some amendments. 

With the INNER JOIN I can merge ORDERFACT and ORDER with the correct number of records as a result but then I want also to add the Review ID of the third csv but any consecutive JOIN will always give a wrong number of records on the merged table.

 

Order_id in the Order table is unique but order_id in the OrderFact is repeated as many times as there are order lines for the same Order_id. Review_id is also unique for any Review made by the customer.

I may try to INNER JOIN the Review table with the Order table where Order_id is unique then?

 

Vegar
MVP
MVP

Again my guess would be that you have multiple rows with the same order_id in you join table. Could it be that you have more than one review per order?

If it is acceptable to only get one review_id per order then you could use applymap, with applymap you are guaranteed not to create duplicate rows. Take a look at the script below. 


Map_Review:
MAPPING LOAD
  order_id,
  review_id
 
FROM
  [lib://DataFiles/olist_order_reviews_dataset.csv] 
  (txt, utf8, embedded labels, delimiter is ',', msq);

OrderFact:
LOAD
  order_id,
  applymap('Map_Review', order_id, '<No Review>') as [Order review],
  order_item_id,

  product_id,
  seller_id,
  shipping_limit_date,
  price,
  freight_value
FROM [lib://DataFiles/olist_order_items_dataset.csv]
  (txt, codepage is 28591, embedded labels, delimiter is ',', msq);

INNER JOIN(OrderFact)
LOAD
  order_id,
  customer_id,
  order_status,
  order_purchase_timestamp,
  order_approved_at,
  order_delivered_carrier_date,
  order_delivered_customer_date,
  order_estimated_delivery_date
FROM [lib://DataFiles/olist_orders_dataset.csv]
  (txt, codepage is 28591, embedded labels, delimiter is ',', msq);

 

 

If it is not acceptable with one review per order then I would consider not to perform any join with the review table,  just leave it as a dimension table in the data model. Like this:

OrderFact:
LOAD
  order_id,
  applymap('Map_Review', order_id, '<No Review>') as [Order review],
  order_item_id,

  product_id,
  seller_id,
  shipping_limit_date,
  price,
  freight_value
FROM [lib://DataFiles/olist_order_items_dataset.csv]
  (txt, codepage is 28591, embedded labels, delimiter is ',', msq);

INNER JOIN(OrderFact)
LOAD
  order_id,
  customer_id,
  order_status,
  order_purchase_timestamp,
  order_approved_at,
  order_delivered_carrier_date,
  order_delivered_customer_date,
  order_estimated_delivery_date
FROM [lib://DataFiles/olist_orders_dataset.csv]
  (txt, codepage is 28591, embedded labels, delimiter is ',', msq);

Review:
LOAD
  order_id,
  review_id
 
FROM
  [lib://DataFiles/olist_order_reviews_dataset.csv] 
  (txt, utf8, embedded labels, delimiter is ',', msq);

 

repetto74
Contributor II
Contributor II
Author

Hi Vegar,

Many thanks for your solution this is exactly what I needed! :-).  I was digging into documentation on how to perform lookups with Qlik and this ApplyMap function is just suiting perfect. Actually the review csv file has 104000 records but many of those have only crap text comment with no ReviewID and OrderID so ApplyMap is just doing the job by mapping the review ID into the OrderFact table (keeping the exact number of records) and clean the Review table with the unusable records (100000 records available in total).

I have to study al little bit this function thanks for spotting this out to my attention! 🙂

Happy with the result thank you 🙂

Rick

 

 

Vegar
MVP
MVP

Glad to hear that you found a way forward with your application.

Good luck with the implementation.

BR
Vegar