Skip to main content
Announcements
Introducing Qlik Answers: A plug-and-play, Generative AI powered RAG solution. READ ALL ABOUT IT!
cancel
Showing results for 
Search instead for 
Did you mean: 
Xinhui
Contributor
Contributor

is there a Talend function I could extract a string pattern from a txt file?

Dear all,

Thanks for your help. I am a beginner and I would like to extract all word start with "GGD" in a txt document? Of course, I could run with Java and I am asking is there a Talend function could do this very easy?

Best,

Xinhui

Labels (1)
5 Replies
Anonymous
Not applicable

Hi

You can read the content line by line or read the whole content as a string, and then extract the data using regex. Can you show us an example of txt document? So that we can help you more.

 

Regards

Shong

Xinhui
Contributor
Contributor
Author

Thanks! Just read your meassge. I have a file like the following. I would like extract all GSE number as GSE160804 in "Accession: GSE160804". I used the tFileinputDelimited with field separator as ":", then I use tFilterRow with advice mode "input_row.columnName1.startsWith("Series"), unfortunately, it is not work, could you help me.

----------------------------

 

1. Integrated analysis of DNA methylation and gene expression profiles identified S100A9 as a potential biomarker in ulcerative colitis

(Submitter supplied) In this research, 90 differential expression mRNAs (DEMs), 72 differential expression lncRNAs (DELs) and biological functions and pathway were identified in ulcerative colitisby (UC) integrated analysis. Potential therapeutic target for treatment was preliminary verified by qRT-PCR experiment and bioinformatics analysis.

Organism: Homo sapiens

Type: Expression profiling by array; Non-coding RNA profiling by array

Platform: GPL20115 6 Samples

FTP download: GEO (TXT) ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE160nnn/GSE160804/

Series Accession: GSE160804 ID: 200160804

 

2. Induced organoids derived from patients with ulcerative colitis recapitulate the colitic reactivity

(Submitter supplied) We report the application of single nucleus RNA-seq technology for transcriptomic-wide profiling of the induced organoids derived from patients with ulcerative colitis (iHUCO) and normal induced organoids derived from the healthy colon (iHNO), along with their parental fibroblasts. By comapring the nucleus profiles of both iHUCOs and their parental fibroblasts (UC FBs) to iHNOs and normal fibroblasts (NL FBs), we found unique signatures exclusive to the UC samples but not the controls. more...

Organism: Homo sapiens

Type: Expression profiling by high throughput sequencing

Platform: GPL24676 11 Samples

FTP download: GEO (MTX, TSV) ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE152nnn/GSE152999/

SRA Run Selector: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA641142

Series Accession: GSE152999 ID: 200152999

 

gjeremy1617088143

HI, if your gse number have always 6 digit you can use this regex : "(GSE\\d{6})" in a tFileInputRegex and after a tUniqRow to avoid duplicate values

Send me Love and Kudos

Xinhui
Contributor
Contributor
Author

thanks!really helpful. Unfortunately, the length is not fixed and even more I would like also to extract the information "homo sapiens"​ and "119 samples". I am more likely to extract more information after ":". do you have some suggestion?

Thanks and best, Xinhui!

Anonymous
Not applicable

Use tFileInputFullRow to read the text file line by line, and then filter the line starting with the fixed string such as "Series Accession" using tFilterRow (with advanced model)

input_row.line.startsWith("Series Accession")

After you filter the line,

Series Accession: GSE152999 ID: 200152999

You only need to write a little Java code to extract the data you need.

 

Regards

Shong