How to Efficiently Clean HTML Tags from Your Data Source
When working with web APIs, one of the problems that we consistently run into is on fields where the user can input into a free-from text box. These fields often flow into Qlik filled with HTML tags, making them difficult to read and larger in size to store. This example demonstrates how we cleaned a "Comment" field, where the user can enter anything into an input box at the source. Here's a sample comment that needs to be cleaned as it would appear from the API.
<div style="" ><div><p>:Tom went to the store to buy apples.<br>There were only 2 apples remaining.<br>So he bought the last 2 apples.</p> </div> </div>
Hopefully the following code will save you some time and hassle trying to clean your data. In short, we load in the Comment table, run a "For" loop through each comment, and within each comment for loop, run a "Do" loop through the HTML tags and clean them for readability. We then populate the ID field (to join back to the original table) and the cleaned comment into a second table which can be joined back once the table is complete.
Comments: LOAD Comment_ID, Comment FROM [lib://QVD Folder Connection/Comments.qvd] (qvd);
For vRow = 1 to NoOfRows('Comments') Let vID = Peek('Comment_ID' , vRow-1 , 'Comments');//get the ID of the next record to parse Let vtext = Peek('Comment' , vRow-1 , 'Comments');//get the original comment of the next record
do while len(TextBetween(vtext, '<', '>'))>0//loop through each comment in place vtext=Replace(vtext,'<br>',chr(10));//replace line breaks with carriage returns - improves legibility vtext=Replace(vtext, '<'&TextBetween(vtext, '<', '>')&'>', '');//find groups with <> and replace them with ''
temp: load Comment_ID, '$(vtext)' as cleancomment Resident Comments Where Comment_ID='$(vID)';
left join (Comments) load * resident temp; drop table temp;