Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
I have fields Show name, Cast, Director, Country released, Genre.
Where most of the rows are similar to.
Show | cast | director |
country |
Sankofa | Haile Gerima | Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra Duah, Nick Medley, Mutabaruka, Afemo Omilami, Reggie Carter, Mzuri | United States, Ghana, Burkina Faso, United Kingdom, Germany, Ethiopia |
The Great British Baking Show | Andy Devonshire | Mel Giedroyc, Sue Perkins, Mary Berry, Paul Hollywood | United Kingdom |
When I use, [SubField(cast,',') as Actor, SubField(director,',') as Director, SubField(country,',') as Country]
Some memory error comes up,
Please someone help me out with the situation!
If it says "field cast not found" is because the resident/from is not loading to the correct data source.
OriginalData:
LOAD
show_id, "type", title, listed_in, date_added,release_year,rating,duration,description,
'Netflix' as Network
FROM [lib://DataFiles/netflix_titles.csv]
(txt, utf8, embedded labels, delimiter is ',', msq);
Concatenate(OriginalData)
... Other platforms
Shows:
NoConcatenate LOAD
show_id, "type", title, listed_in, date_added,release_year,rating,duration,description,Network
Resident OriginalData;
Cast:
NoConcatenate LOAD
show_id, Subfield(cast,',') as cast
Rrsident OriginalData;
Director:
NoConcatenate LOAD
show_id, Subfield(director,',') as director
Rrsident OriginalData;
Country:
NoConcatenate LOAD
show_id, Subfield(country,',') as country
Rrsident OriginalData;
DROP Table OriginalData;
Both seem problems caused by the source of the data, it would be better to correct them there so the file has a correct structure.
For point 1 you can check for any character that only has one of the fields, and use this character to reading, something like:
If(Index(duration,':')>0, duration, rating) as duration,
If(Index(duration,':')>0, rating, duration) as rating
I don't know about the 2nd.
Hi, doing subfield that way you are creating a cartesian table with every value of a row related to every other, and actors, directors and countries would make a lot of records.
Maybe you can try with a star schema:
- Central table loads just show, and the fields related 1-1 to the show, if there is any, if not you can omit this table.
- Another table for cast with "LOAD Show, SubField(cast,',') as Actor From/Resident..."
- Another table for directors with "LOAD Show, SubField(director,',') as Director From/Resident..."
- Another table for the country with "LOAD Show, SubField(country,',') as Country From/Resident..."
So Show is the link between all tables and you're reducing a lot the number of final rows.
In a table with 10 actors, 2 directors, and 8 countries, a cartesian table will have 10*2*8=160 rows of data. Split in tables it would be 10+2+8= 20 rows.
Thanks for your guidance! But its kinda not working, when I dont take cast in central table instead only in resident, it says field cast not found!
Actually I have 4 different platforms/tables with all similar field names and similar problems as I mentioned before.
Netflix:
LOAD
show_id, "type", title, cast, director, listed_in, country,date_added,release_year,rating,duration,description,
'Netflix' as Network
FROM [lib://DataFiles/netflix_titles.csv]
(txt, utf8, embedded labels, delimiter is ',', msq);
Similarly for Amazon,Disney and HULU.
I hope I will get a useful solution.
If it says "field cast not found" is because the resident/from is not loading to the correct data source.
OriginalData:
LOAD
show_id, "type", title, listed_in, date_added,release_year,rating,duration,description,
'Netflix' as Network
FROM [lib://DataFiles/netflix_titles.csv]
(txt, utf8, embedded labels, delimiter is ',', msq);
Concatenate(OriginalData)
... Other platforms
Shows:
NoConcatenate LOAD
show_id, "type", title, listed_in, date_added,release_year,rating,duration,description,Network
Resident OriginalData;
Cast:
NoConcatenate LOAD
show_id, Subfield(cast,',') as cast
Rrsident OriginalData;
Director:
NoConcatenate LOAD
show_id, Subfield(director,',') as director
Rrsident OriginalData;
Country:
NoConcatenate LOAD
show_id, Subfield(country,',') as country
Rrsident OriginalData;
DROP Table OriginalData;
Really great one Brother!
I also have 1. some rows of duration and rating interchanged
2.some rows of title contains title as well as rest of the field values separated by ','
, any good way to resolve that?
Both seem problems caused by the source of the data, it would be better to correct them there so the file has a correct structure.
For point 1 you can check for any character that only has one of the fields, and use this character to reading, something like:
If(Index(duration,':')>0, duration, rating) as duration,
If(Index(duration,':')>0, rating, duration) as rating
I don't know about the 2nd.
Okay with the 2nd one, I shall correct it in the source itself as I have only 2 fields with such a problem.
But with the 1st one, I got no specific character in those fields.
Duration consists of 'specific' min and 'specific' Season/s
whereas
Rating is filled with variety like R, TV-MA, TV-G, NR,13+,etc.
Thank you!
Can you post some samples of the duration values?
Or if you have Ratings defined you can use an inline table and use applymap to check if it's any of the rating values:
mapRatings:
Mapping LOAD *, 1 as israting Inline[
R
TV-MA
TV-G
..
];
LOAD ...
If(applymap('mapRatings',rating,0)=1, rating, duration) as rating,
If(applymap('mapRatings',rating,0)=1, duration, rating) as duration
Thank you for the help, everything worked well !
How I solved Durations and Ratings column misplaced values issue;
Durations:
NoConcatenate Load show_id,
If(Index(duration,'min')>0, duration, WildMatch(rating,'*min*')) as RunTime,
If(Index(duration,'Seasons')>0, duration, '1 Season') as Seasons
Resident Data;
Ratings:
NoConcatenate Load show_id,
If(WildMatch(rating,'*TV*','*ALL*','R','N*','AGE*','13+','16','16+'
,'17','18+','7+','G'),
rating,
WildMatch(duration,'*TV*','*ALL*','R','N*','AGE','13+','16','16+'
,'17','18+','7+','G')) as rating
Resident Data;