Extract data points from unstructured text with the RegEx Text Analytics Web Connector

    One of the less known features of the Sentiment & Text Analytics Connector is the RegEx option to extract identifiable data based on patterns in unstructured data fields. RegEx - Regular Expressions - is common across many different data analysis tools. There are some really useful websites that I'd suggest you look at to learn more.


    This is often required for log or event records where the data is saved as a text string rather than a delimited or table format. I thought I'd explore some of the basic options available and learnt a little about RegEx to work out how to recognise a few different data patterns.

     

     

    Learn about RegEx

     

    RegExr: Learn, Build, & Test RegEx - This site has lots of example and also allows you to test your expressions. You can change the text so that you test all of your different scenarios and can then copy them into the list of your tests for your Qlik application.

    Regular Expression Examples - Another great site with some really good explanations on how the expressions were constructed.

     

     

    Some working examples

    Here are a number of examples which I have included in the attached Qlik Sense application, along with the example source data.

     

    Regex Test DescriptionRegex Test Syntax
    10 digit number(\b([0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9])\b)
    12 digit number(\b([0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9])\b)
    16 digit number(\b([0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9])\b)
    IP address(\B\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\B)
    Email address(\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b)
    MM/DD/YYYY MM/DD/YY date((0[1-9]|1[012])[-/.](0[1-9]|[12][0-9]|3[01])[-/.]([1-2][0-9][0-9][0-9]|[0-9][0-9]))
    DD/MM/YYYY DD/MM/YY date((0[1-9]|[12][0-9]|3[01])[- /.](0[1-9]|1[012])[- /.]([1-2][0-9][0-9][0-9]|[0-9][0-9]))
    YYYY/MM/DD date(([1-2][0-9][0-9][0-9])[-/.](0[1-9]|1[012])[-/.](0[1-9]|[12][0-9]|3[01]))
    HH24:MM:SS time((0[1-9]|[1][0-9]|2[0-4])[:]([0-5][0-9])[:]([0-5][0-9]))

     

     

    Notes to consider when adapting it for your requirement

     

    1. Load your unstructured data in Qlik Sense. Ensure you have a key for each row (I have used a ROWNO field) as you will want to associate your results back to the original record.
    2. Select the RegEx tests that you want to run on the data. Use the website links above to investigate and test your syntax. Load these into your Qlik Sense app from an external data source (i.e. excel). NOTE: you will not be able to use a comma separated file as commas are used in RegEx syntax.
    3. Ensure you have the URL encode 'sub' in your script.  see Sentiment and Text Analytics Connector ‒ Qlik Connectors for an example. You will need to ensure that all special characters in your unstructured text and in your RegEx syntax are included in the encoding. In my example i have included a range of arabic script encoding as I work in Dubai and need to deal with this at times. Here's a list of common characters (and letters with accents) HTML URL Encoding Reference.
    4. Setup your loop script to run each record at a time through the RegEx text analytics connector. The URL encoding is required as the text, and the RegEx test, are included in the URL.

     

     

    Check out the example attached.

    pic6.PNG