Skip to main content
Announcements
NEW: Seamless Public Data Sharing with Qlik's New Anonymous Access Capability: TELL ME MORE!
cancel
Showing results for 
Search instead for 
Did you mean: 
klikgevoel
Contributor III
Contributor III

Scrape webpage in one blow

Hi all,

I've got a problem with an inefficient and non-future proof solution so far. What I want is to scrape data from an API that allows 10.000 records to be scraped. My sad solution so far is creating multiple tables, each skipping a certain set of records. Compare the FROM-statement between raw_CAO_CBS1 & raw_CAO_CBS2 for example.

raw_CAO_CBS1:
LOAD
    [content/properties/ID] as ID_1,
    [content/properties/CaoSectoren] as CaoSectoren_1,
    [content/properties/Perioden] as Perioden_1,
    [content/properties/CaoLonenPerUurInclBijzBeloningen_4] as CaoLonenPerUurInclBijzBeloningen_4_1,
    [content/properties/CaoLonenPerUurInclBijzBeloningen_11] as CaoLonenPerUurInclBijzBeloningen_11_1,
    [content/properties/PercentageAfgeslotenCaoS_15] as PercentageAfgeslotenCaoS_15_1
FROM [https://opendata.cbs.nl/ODataFeed/odata/82838NED/TypedDataSet] (XmlSimple, Table is [feed/entry])
Where [content/properties/Perioden] = '2019MM01';

raw_CAO_CBS2:
LOAD
    [content/properties/ID] as ID_2,
    [content/properties/CaoSectoren] as CaoSectoren_2,
    [content/properties/Perioden] as Perioden_2,
    [content/properties/CaoLonenPerUurInclBijzBeloningen_4] as CaoLonenPerUurInclBijzBeloningen_4_2,
    [content/properties/CaoLonenPerUurInclBijzBeloningen_11] as CaoLonenPerUurInclBijzBeloningen_11_2,
    [content/properties/PercentageAfgeslotenCaoS_15] as PercentageAfgeslotenCaoS_15_2
FROM [https://opendata.cbs.nl/ODataFeed/odata/82838NED/TypedDataSet?$skip=10000] (XmlSimple, Table is [feed/entry])
Where [content/properties/Perioden] = '2019MM01';

I repeat this process by adding tables, editing the record names so no synthetic table is formed, and adjust the end of the FROM statement each time by ?&skip=10000 to ?&skip=20000 .... 30000 and so on.

 

Question: is there an elegant way in which I can circumvent the multiple table creations? So that in the future, when new data is added, I don't have to create new tables (enhance maintainability).

 

Thanks in advance 🙂

1 Solution

Accepted Solutions
marcus_sommer

You don't need and shouldn't create for each load an own table else just loading all records into a single table maybe with a loop like this one:

for i = 0 to 200
   let vSkip = $(i) * 10000;
   table:
      load YourFields, $(i) as Counter
      from [https://opendata.cbs.nl/ODataFeed/odata/82838NED/TypedDataSet?$skip=$(vSkip)] ...;

   if mod(noofrows('table') / 10000) > 0 then
      exit for;
   end if

next

- Marcus

View solution in original post

3 Replies
marcus_sommer

You don't need and shouldn't create for each load an own table else just loading all records into a single table maybe with a loop like this one:

for i = 0 to 200
   let vSkip = $(i) * 10000;
   table:
      load YourFields, $(i) as Counter
      from [https://opendata.cbs.nl/ODataFeed/odata/82838NED/TypedDataSet?$skip=$(vSkip)] ...;

   if mod(noofrows('table') / 10000) > 0 then
      exit for;
   end if

next

- Marcus

klikgevoel
Contributor III
Contributor III
Author

Nice and elegant solution. For the first line, i = 0 to 200, I reckon that 200 is an arbitrary number you took?

Second, for relatively low-skilled Qlikview like me, the Modulo operator only accepts integers, as such:

if Mod(NoOfRows('Your_Table'),10000) > 0

For the rest, my greatest thanks!

marcus_sommer

Yes, it was just arbitrary. If you know the approximately number of records you could use an exact number and if not you need to ensure that those number * 10000 is definitely larger as the number of records (and to run not in an error the extra exit for statement - whereby using ERRORMODE might be an alternatively).

- Marcus