26 Replies Latest reply: Dec 5, 2013 10:58 AM by John Lifter RSS

    How to get filename and date of last change from read file?

    Florian Spichal

      Hi,

       

      is it possible to get the current filename and/or some file_parameters such as "last change" in my transformation operator?

       

      Thanks!

        • Re: How to get filename and date of last change from read file?
          Hugo Sheng

          You'll need to figure out the correct dos dir command parameters to use ( or use a Linux ls equivalent)... The sample function below will return the list of files and you'll need to also parse out the file names from the dates. 

           

          function scandir(directory)

              local i, t = 0, {}

              for fileinfo in io.popen("dir/b/s \"C:\\MyDir\\Data\\*.txt””):lines() do

                  i = i + 1

                  t[i] = fileinfo

              end

              return t

          end

            • Re: How to get filename and date of last change from read file?
              Hugo Sheng

              found the following will list file names and timestamp:

               

              for %a in (c:\temp\*.*) do @echo %~nxa %~ta

               

              couldn't figure out a dir command sequence that would produce the same...

              • Re: How to get filename and date of last change from read file?
                Florian Spichal

                Hi Hugo, thanks for your reply. It is not exactly that what I am looking for ... I have an operator read file in Expressor and a transformation operator. In this transformation operator I want to use the current filename and date of the file. Is it possible?

                  • Re: How to get filename and date of last change from read file?
                    Hugo Sheng

                    There's no real way (or none I can think of) to retrieve that information from within the Transform operator when using a Read File operator upstream.

                     

                    Instead, if you use a Read Custom operator, you could then use the store_string() and store_datetime() functions to write out a persistent value containing the filename and timestamp.  Next you can retrieve it using retrieve_string() and retrieve_timestamp() from within the Transform operator.  This means that you would need to specify the file you are reading in within the Read Custom and then use a similar approach as shown above to grab the timestamp.

                      • Re: How to get filename and date of last change from read file?
                        Florian Spichal

                        Hi,

                         

                        I found a way in combination with Hugos snippet to get information for all my *.csv file in one directory.

                        That helps a lot.

                         

                        Could someone explain me now how I can iterate through these files to read and use any of them step by step in a dataflow? For example I have now the list with all relevant files and then I want to read each one and performe some transformations (for every file the same) on it.

                         

                        Thank you!

                          • Re: How to get filename and date of last change from read file?
                            Hugo Sheng

                            Are you using the Read Custom operator to load the list of csv files into a table (lua table, not database table)?  You can do that in the Initialize function of the Read Custom operator.  Then in the Read function below that, you can iterate through each file and read it in.    

                             

                            You'll read in a record at a time and then split it into individual columns.   Downstream from that you can use the appropriate operators to transform your data accordingly.

                             

                             

                              • Re: How to get filename and date of last change from read file?
                                Florian Spichal

                                Yes, I am using the read custom operator. Below you can see my example. I am wondering why I cannot use my code in the initialize() function instead of read() function. In my initialize() function it returns only one line :-/

                                  • Re: How to get filename and date of last change from read file?

                                    You can't emit a record from the initialize function.  The initialize function is invoked as the Read Custom operator is created, and may be invoked and complete processing before the downstream operator is ready to receive a record (as it may still be undergoing instantiation).

                                     

                                    The approach you have taken that uses the iteratative return in the read function is the correct approach.  It will allow the Read Custom operator to emit a row for each file to be processed.

                                      • Re: How to get filename and date of last change from read file?
                                        Florian Spichal

                                        OK, but how can I read my files now out of my custom read operator?

                                        I am thinking of using "io.open(filename)" or something like this in my read() function instead of the initialize function. Might that work?

                                          • Re: How to get filename and date of last change from read file?

                                            This thread now involves two questions:

                                            1. How to retrieve the name and last modified date for files in a directory and pass them on to downstream operators.
                                            2. How to process a collection of files that are located in the same directory.

                                            Although you have a workable approach to the first question, you might find this code, which uses the Windows dir command to return a listing of the contents of a directory, more straight-forward.  The variables file and date will now hold the name and last modified date for each file in the directory.  Note how the file name and modified date are saved in a datascript table named files.  Outside of the for loop, add code to iterate over this table, processing the information however you want.

                                             

                                            files = { }

                                            directory = "C:\\junk\\"

                                            file_handle = io.popen(string.concatenate("dir ",directory))

                                            for line in file_handle:lines() do

                                            if not is.null(line) then  

                                               dir = string.match(line,".+%s+(<DIR>)%s+.+")

                                               if is.null(dir) then

                                                 if string.match(line,"^(%d)") then

                                                   file = string.trim(string.match(line,"(%s%S-)$"))

                                                   date = string.trim(string.match(line,"^[%d%p%s]+"))

                                                   files[#files+1] = {file,date}

                                                 end

                                               end

                                            end

                                            end 

                                             

                                            The second question is deals with a different scenario.  Now you want to process each record read from a collection of files in one continuous execution of the dataflow.  Whether you can do this from within Studio  depends on the type of file. If the files to be processed are simple text files (csv, txt, etc) then this is doable from within Studio.  If the collection of files are Excel files (.xls, .xlsx) then you would not be able to do this from within Studio as you would have difficulty reading and parsing this file content. Assuming text files, there are two operations.

                                            1. Read each line from each file
                                            2. Parse the line into its constituent fields

                                             

                                            Although both of these steps can be performed within the read function in the Read Customer operator, it would be easier to separate the operations, using the Read Custom to read each line as a single large string, and parsing into fields in a downstream Transform operator.

                                             

                                            The following figure illustrates the code you could put into the Read Custom operator; note that this code has no interest in retrieving the last modified date, so it uses the /b flag to dir to limit the return to just the file names.  The code uses the Windows type command to concatenate the contents of all the files into one continuous stream of records.

                                             

                                            code.png

                                             

                                            One thing you should be aware of is whether the files have header rows or not.  If so, you will want to use either a Filter operator or the filter helper function in the Transform operator to drop those rows.

                                             

                                            How you parse each line into its constituent fields depends on whether the lines contain delimited or fixed width data and what the delimiter character or character string is.

                                              • Re: How to get filename and date of last change from read file?

                                                I would really like to pass the file name of the current file being read downstream to the attribute I have named filename?  I have tried to incorporate some of this code into this code unsuccessfully can someone kindly tell me how I would write that here.

                                                 

                                                Thanks Traci

                                                 

                                                 

                                                  • Re: How to get filename and date of last change from read file?

                                                    This is a bit involved to do when using the Desktop version of Expressor as it is up to your code to keep track of which file is being read.

                                                     

                                                    With the code you now have, you are concatenating the names of all the files together and then using the windows type command to read each line from each file into your application.  The problem is that the type command does not indicate which file is currently being read.

                                                     

                                                    To do what you want, you would need to change the code such that you have a loop that reads a different file for each iteration of the loop and when one file is completed, loops around and processes the next file.  However, the next issue is that the remainder of the dataflow may be processing records from multiple files simultaneously, for example, the last record from file 1 is being processed just ahead of the first record in file 2.  Therefore, your code must add the name of the file to each record as it is emitted from the Read Custom operator so that, if desired, you can separate the final records as they are emitted from the dataflow.

                                                     

                                                    In the initialize function, simply build up a table containing the names of the files to process.  Then set a file handle to read the first file.

                                                     

                                                    In the read function, read lines from the first file.  When that file has been fully processed, drop the file handle and create another file handle to the next file and process it line by line.  Work through this table in this way.  You will be able to extract the name of the file from the table and add the filename to each record.  Note that if each file includes a header row, your code will need to skip over than row.  When all the files have been processed, your code can shutdown the Read Custom operator.

                                                      • Re: How to get filename and date of last change from read file?

                                                        John,

                                                        Right that is what I want to do unfortunately I am just starting to learn the extension SDK and this is a bit overwhelming to me.  I don't understand the flow of initialize and read enough or the syntax to take your suggestion.

                                                        Is there an example anywhere of something similar?

                                                         

                                                        I understand the code to load a table with the file names but not how to do the following?

                                                         

                                                        "Then set a file handle to read the first file.   In the read function, read lines from the first file.  When that file has been fully processed, drop the file handle and create another file handle to the next file and process it line by line.  Work through this table in this way.  You will be able to extract the name of the file from the table and add the filename to each record."


                                                        Would it be a for loop in my initialize to the read?  Can you help with how that would look.

                                                         

                                                          Thanks Traci

                                                        OK, I have something - but this returns the filesnames but not the data?

                                                        I need both?

                                                         

                                                        require "expressor.ScriptSupport"

                                                         

                                                        files = {}

                                                        directory = string.concatenate(_expCurrentOperatorParameters.Path ,"\\")

                                                        file_list = ""

                                                        file_handle = nil

                                                        fielddelimiter = ','

                                                        header = {"place","lastname","firstname","party"}

                                                         

                                                        function initialize()

                                                           -- obtain the file name format property value

                                                         

                                                           file_handle=io.popen("for %a in (c:\\data\\*.*) do @echo %~nxa; %~ta")

                                                          log.notice(file_list)

                                                         

                                                          i=1

                                                           --file_handle = io.popen(string.concatenate("type ",file_list))

                                                        end

                                                         

                                                        function read()

                                                           -- retrieve a line

                                                          

                                                           line = file_handle:read("*l")

                                                          

                                                           return function()

                                                          

                                                           if line then

                                                            

                                                             log.notice(line)

                                                            

                                                             output = {}

                                                             values = {}

                                                               

                                                            

                                                             -- add field delimiter to line

                                                             line = string.concatenate(line,fielddelimiter)

                                                             -- parse line using comma as field delimiter

                                                             pattern = string.concatenate("(.-)",fielddelimiter)

                                                            

                                                             for value in string.iterate(line,pattern) do

                                                               values[#values+1] = value

                                                             end

                                                             -- initialize attributes in output

                                                             for index,value in ipairs(header) do

                                                             output[value] = values[index]

                                                             --log.notice(output[value])

                                                             end

                                                             file = string.trim(string.match(line,"(%s%S-)$"));

                                                             --output["filename"]=file;

                                                             line=file_handle:read("*l")

                                                             i=i+1

                                                                

                                                             -- return output record

                                                             return output

                                                        --    return expressor.ScriptSupport.OK, output

                                                           else

                                                             -- terminate processing

                                                             file_handle:close()

                                                             return expressor.ScriptSupport.FlowComplete

                                                           end

                                                        end

                                                        end

                                                          • Re: How to get filename and date of last change from read file?

                                                            code.pngYou want something like the following screen shot.

                                                              • Re: How to get filename and date of last change from read file?

                                                                John,

                                                                  Thanks so much I was excited to try this code, but it throws the following errors.

                                                                Any ideas why?  I wish I could understand the language better.

                                                                 

                                                                Thanks Traci

                                                                 

                                                                 

                                                                <step id="0" step_name="Step_1" process="5436" run="0" status="ok" start="2013-12-02T15:28:07">

                                                                Use Property & Connection 1 - READ_CUSTOM-0015-A: While processing record 188 the read function failed: 'unnamed chunk:34: attempt to index global 'file_handle' (a nil value)'. (Dataflow1_Copy.Step_1)

                                                                Use Property & Connection 1 - OPERATOR-0016-F: toolId 1.0, name 'Use Property & Connection 1' - Exception 'DatascriptException' occurred in the 'process' function for thread 0. (Dataflow1_Copy.Step_1)

                                                                Use Property & Connection 1 - LUA_HELPERS-0033-A: Error in Datascript while processing read: 'unnamed chunk:34: attempt to index global 'file_handle' (a nil value)

                                                                stack traceback:

                                                                    unnamed chunk:34: in function <unnamed chunk:31>'. (Dataflow1_Copy.Step_1)

                                                                Use Property & Connection 1 - OPERATOR-0005-F: toolId 1.0, name 'Use Property & Connection 1' - the 'process' function failed for thread 0. (Dataflow1_Copy.Step_1)

                                                                Use Property & Connection 1 - OPERATOR-0081-F: toolId 1.0, name 'Use Property & Connection 1' - Exception 'DatascriptException' occurred in the 'shutdown' function. (Dataflow1_Copy.Step_1)

                                                                Use Property & Connection 1 - LUA_HELPERS-0033-A: Error in Datascript while processing finalize: 'unnamed chunk:88: attempt to index global 'file_handle' (a nil value)

                                                                stack traceback:

                                                                    unnamed chunk:88: in function <unnamed chunk:87>'. (Dataflow1_Copy.Step_1)

                                                                Use Property & Connection 1 - OPERATOR-0007-F: toolId 1.0, name 'Use Property & Connection 1' - the 'shutdown' function failed. (Dataflow1_Copy.Step_1)

                                                                Use Property & Connection 1 - ETOOL-0006-F: operator 1.0, type in-datascript, name 'Use Property & Connection 1': etool failed, phase shutdown. (Dataflow1_Copy.Step_1)

                                                                Use Property & Connection 1 - ETOOL-0017-A: operator 1.0, type in-datascript - shutdown failed. (Dataflow1_Copy.Step_1)

                                      • Re: How to get filename and date of last change from read file?

                                        Nope we only have the @mention in the corner, no Attach link we tried in IE, Firefox and Chrome......