Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
I have a bucket with multiple parquet files. I need to get from all of the files within a folder, the unique ids. I need to have that in Talend so I can loop through the next steps.
In terminal, using spark, I can run the statement below and get the ids that I need. Is there a way to run that in a tSystem or some other component that will return the df list?
df = spark.read.parquet("s3a://talend/bronze/books/").select("bookId").distinct()
df.show(false)
Hello
You can run PySpark commands in a tSystem component like you did in terminal. Refer to these topics to learn how to execute a python script file using tSystem component.
Before running the PySpark commands, you must have a Spark environment set up.
Regards
Shicong
Hello
You can run PySpark commands in a tSystem component like you did in terminal. Refer to these topics to learn how to execute a python script file using tSystem component.
Before running the PySpark commands, you must have a Spark environment set up.
Regards
Shicong