Skip to main content
Announcements
Accelerate Your Success: Fuel your data and AI journey with the right services, delivered by our experts. Learn More
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

Need to parse the log file with tFileInputRegex by ignoring the new line character as row separator

My Input log file looks like this 

 

2017-05-09 10:18:52.743 INFO  (qtp1543727556-22) [   x:UIMATestCollection1] o.a.s.u.p.LogUpdateProcessorFactory [UIMATestCollection1]  webapp=/solr path=/update params={}{} 0 66
2017-05-09 10:18:52.745 ERROR (qtp1543727556-22) [   x:UIMATestCollection1] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ERROR: [doc=1] unknown field 'sentence'
	at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:183)
	at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:82)
	at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:277)
	at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:211)
	at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)

 I am using tFileInputRegex component 

 

The regex to parse the file is as shown here 

 

"^"+
"([0-9]{4}\\-[0-9]{2}\\-[0-9]{2})"+" "+
"([0-9]{2}\\:[0-9]{2}\\:[0-9]{2}\\.[0-9]{3})"+" "+
"(.*?)"+" "+
"\\((.*)\\)"+" "+
"\\[(.*)\\]"+" "+
"(.*)"

I am getting the partial output as shown below 

 

.----------+------------+---------+----------------+--------------------------------------------------------------------------------------------------------+-------------------------------------------.
|                                                                                               tLogRow_1                                                                                               |
|=---------+------------+---------+----------------+--------------------------------------------------------------------------------------------------------+------------------------------------------=|
|Date      |Time        |Log_Level|App_Thread      |Collection                                                                                              |Message                                    |
|=---------+------------+---------+----------------+--------------------------------------------------------------------------------------------------------+------------------------------------------=|
|2017-05-09|10:18:52.743|INFO     |qtp1543727556-22|   x:UIMATestCollection1] o.a.s.u.p.LogUpdateProcessorFactory [UIMATestCollection1                      | webapp=/solr path=/update params={}{} 0 66|
|2017-05-09|10:18:52.745|ERROR    |qtp1543727556-22|   x:UIMATestCollection1] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ERROR: [doc=1|unknown field 'sentence'                   |
'----------+------------+---------+----------------+--------------------------------------------------------------------------------------------------------+-------------------------------------------'

0683p000009LuVU.pngtFileInputRegex Configaration

But i want  tFileInputRegex to ignore the row separator ("\n")  when parsing the above input file and need to include the error message in the second line in the last column by ignoring the row separator. Please suggest if any solution.

Labels (1)
1 Solution

Accepted Solutions
Anonymous
Not applicable
Author

Hello 

tFileInputRegex read the file line by line, each line will be parsed with regex. As a workaround, read the whole file content as a string, replace all the new line character+at character to a special character, output the string to a temporary file before parsing it with regex. After parsing the file, replace all the special characters with new line character+at if needed, for example:

tfileinputRaw--main--tJavaRow1--main--tFileOutputDelimited

   |

onsubjobok

   |
tFileInputRegex--main--tJavaRow2--main--tLogRow

 

tFileInputRegex: read the new file generated by tfileOuputDelimited.

 

on tJavaRow1:

output_row.content = (input_row.content.toString()).replaceAll("\r\n at","@");

 

on tJavaRow2:

output_row.Date=input_row.Date;

//...other columns....

output_row.Message=input_row.replaceAll("@","\r\n");

 

Regards

Shong

View solution in original post

6 Replies
Anonymous
Not applicable
Author

Hello 

tFileInputRegex read the file line by line, each line will be parsed with regex. As a workaround, read the whole file content as a string, replace all the new line character+at character to a special character, output the string to a temporary file before parsing it with regex. After parsing the file, replace all the special characters with new line character+at if needed, for example:

tfileinputRaw--main--tJavaRow1--main--tFileOutputDelimited

   |

onsubjobok

   |
tFileInputRegex--main--tJavaRow2--main--tLogRow

 

tFileInputRegex: read the new file generated by tfileOuputDelimited.

 

on tJavaRow1:

output_row.content = (input_row.content.toString()).replaceAll("\r\n at","@");

 

on tJavaRow2:

output_row.Date=input_row.Date;

//...other columns....

output_row.Message=input_row.replaceAll("@","\r\n");

 

Regards

Shong

Anonymous
Not applicable
Author

Thanks For your Support. Really It helped a lot.

I am working on it. But Got stuck with very little Error..

 

tfileinputRaw--main--tJavaRow1--main--tFileOutputDelimited

 

This is my tJavaRow1 
output_row.content = (input_row.content.toString()).replaceAll("\n\tat","@");

 

Below is my input file

2017-05-09 10:18:52.745 ERROR (qtp1543727556-22) [   x:UIMATestCollection1] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ERROR: [doc=1] unknown field 'sentence'
	at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:183)
	at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:82)
	at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:277)
	at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:211)
	at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)
	at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)

 

Output in the tFileOutputDelimiteris

 

2017-05-09 10:18:52.745 ERROR (qtp1543727556-22) [   x:UIMATestCollection1] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ERROR: [doc=1] unknown field 'sentence'
@ org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:183)
@ org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:82)
@ org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:277)
@ org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:211)
@ org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)
@ org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)

if i use tJavaRow2 and put the following command below replaceAll("\n@","@") is not working. I am getting output as above

 

 

 

tLogRow output is 

 

2017-05-09 10:18:52.745 ERROR (qtp1543727556-22) [   x:UIMATestCollection1] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ERROR: [doc=1] unknown field 'sentence'
@ org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:183)
@ org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:82)
@ org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:277)
@ org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:211)
@ org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)
@ org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)
[statistics] disconnected

Now I want to remove \n before @   in my output file.

 

My expected output is 

2017-05-09 10:18:52.745 ERROR (qtp1543727556-22) [   x:UIMATestCollection1] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ERROR: [doc=1] unknown field 'sentence’@ org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:183)@ org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:82)@ org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:277)@ org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:211)@ org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)@ org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)

if I put the same multiline in Eclipse and use  val b0683p000009M9p6.pngtring = a.replaceAll("\n@", "@"); in scala output is getting in single line.

can u please suggest something on this. 

Thanks In Advance....

Anonymous
Not applicable
Author

Hi
This is my tJavaRow1:
output_row.content = (input_row.content.toString()).replaceAll("\r\n at","@");
It generates only line in the output file, it seems you don't use the same code on tJavaRow1.

Regards
Shong
Anonymous
Not applicable
Author

Thanks for the reply and support. I tried yours tJavaRow Code

output_row.content = (input_row.content.toString()).replaceAll("\r\n at","@");

It is not showing any changes 

my Input file contains first \n after \r and at. may be for that. 

 

2017-05-09 10:18:52.745 ERROR (qtp1543727556-22) [   x:UIMATestCollection1] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ERROR: [doc=1] unknown field 'sentence'
	at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:183)
	at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:82)
	at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:277)
	at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:211)
	at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)
	at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)

By yours tJavaCode  i am getting same output like below (after executing)

 

2017-05-09 10:18:52.745 ERROR (qtp1543727556-22) [   x:UIMATestCollection1] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ERROR: [doc=1] unknown field 'sentence'
	at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:183)
	at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:82)
	at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:277)
	at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:211)
	at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)
	at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)

so after trying with yours i changed to 

 

output_row.content = (input_row.content.toString()).replaceAll("\n\tat","@"); 

 

which is giving 

 

2017-05-09 10:18:52.745 ERROR (qtp1543727556-22) [   x:UIMATestCollection1] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ERROR: [doc=1] unknown field 'sentence'
@ org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:183)
@ org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:82)
@ org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:277)
@ org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:211)
@ org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)
@ org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)

now i want to get the above output in a single line. 

for that tJavaRow2 i used with 

output_row.content = (input_row.content.toString()).replaceAll("\n@","@");

 

But getting the above output only no changes means not able to remove the \n

 

2017-05-09 10:18:52.745 ERROR (qtp1543727556-22) [   x:UIMATestCollection1] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ERROR: [doc=1] unknown field 'sentence'
@ org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:183)
@ org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:82)
@ org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:277)
@ org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:211)
@ org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)
@ org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)

 

In this I put the exported talend job (Archive file to import) and input file 

Can U Please check if posible 

https://drive.google.com/open?id=0B-hwVI6s7kodd0dWWFUtVWZHRTg

https://drive.google.com/open?id=0B-hwVI6s7kodSlVSMXNKbmNYeDg

 

 

TRF
Champion II
Champion II

Try to change "\n" by "\\n" as "\" is a special character for regex.

output_row.content = (input_row.content.toString()).replaceAll("\\n@","@")
Anonymous
Not applicable
Author

Thanks a  lot it Worked for me.....

Thanks for Support....