DataSource Problem

David_Lan · ‎2024-09-30

Hello,

We have a problem with our datasource.

Every day the server which contain the datasource (postgres database) is rebooted.

And every day all jobs linked to the datasource doesn't work. Wa have the following error (FATAL: terminating connection due to administrator command).

I think our job (route component) is keeping using the datasource (which is broken).

Can you help ?

See our xml file for the Datasource Configuration.

Anonymous · ‎2024-09-30

Hello @David_Lan , I've been able to reproduce your isse with the pgql command line client, and the behavior is somewhat curious to me.

To reproduce the issue:

1) Login into a remote PGSQL server with a client throught TCP.

2) Get the client session PID with "SELECT * FROM pg_stat_activity WHERE usename = 'user' and client_addr = '192.168.1.74';".

 datid | datname | pid | leader_pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start | query_start | state_change | wait_event_type | wait_event | state | backend_xid | backend_xmin | query_id | query | backend_type
-------+---------+------+------------+----------+---------+------------------+--------------+-----------------+-------------+-------------------------------+------------+-------------+-------------------------------+-----------------+------------+-------+-------------+--------------+----------+-------+----------------
16388 | user | 4113 | | 16389 | user | psql | 192.168.1.74 | | 42420 | 2024-09-30 21:28:09.056584+00 | | | 2024-09-30 21:28:09.071925+00 | Client | ClientRead | idle | | | | | client backend
(1 row)

3) kill the the client process (kill -15 16389)

Now, connections are in the following state:

Client (192.168.1.74):

tcp CLOSE-WAIT 164 0 192.168.1.74:42420 192.168.1.73:5432 users:(("psql",pid=3849,fd=3)) timer:(keepalive,113min,0)

Server (192.168.1.73):

tcp FIN-WAIT-2 0 0 192.168.1.73:5432 192.168.1.74:42420 timer:(timewait,19sec,0)

When the timer exceed, the server finally close the connection from his end (OS settings).

But the Client (192.168.1.74) is still in "CLOSE-WAIT" state.

Now, the first command sent from the client in the closed socket will thrown the below error:

user=> \l
FATAL:  terminating connection due to administrator command

What's curios is what's happen at TCP level.

The Client (192.168.1.74) tries to send data to the Server (192.168.1.73) throguth the socket he thinks is still open and the Server responds with RSTs:

21:44:49.380009 IP (tos 0x0, ttl 64, id 38012, offset 0, flags [DF], proto TCP (6), length 542)
192.168.1.74.42420 > 192.168.1.73.5432: Flags [P.], cksum 0x85f4 (incorrect -> 0x89c0), seq 3828743076:3828743566, ack 2994375439, win 487, options [nop,nop,TS val 1756669886 ecr 1199618454], length 490

21:44:49.380010 IP (tos 0x0, ttl 64, id 38013, offset 0, flags [DF], proto TCP (6), length 76)
192.168.1.74.42420 > 192.168.1.73.5432: Flags [P.], cksum 0x8422 (incorrect -> 0xc54b), seq 490:514, ack 1, win 487, options [nop,nop,TS val 1756669886 ecr 1199618454], length 24

21:44:49.380049 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
192.168.1.73.5432 > 192.168.1.74.42420: Flags [R], cksum 0x3386 (correct), seq 2994375439, win 0, length 0

21:44:49.380061 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
192.168.1.73.5432 > 192.168.1.74.42420: Flags [R], cksum 0x3386 (correct), seq 2994375439, win 0, length 0

21:44:49.380208 IP (tos 0x0, ttl 64, id 38014, offset 0, flags [DF], proto TCP (6), length 52)
192.168.1.74.42420 > 192.168.1.73.5432: Flags [F.], cksum 0x840a (incorrect -> 0x0513), seq 514, ack 1, win 487, options [nop,nop,TS val 1756669887 ecr 1199618454], length 0

21:44:49.380222 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
192.168.1.73.5432 > 192.168.1.74.42420: Flags [R], cksum 0x3386 (correct), seq 2994375439, win 0, length 0

When the client understands that the connection is closed (no ACK for data) it create a new connection with the three way handshake:

21:44:49.380415 IP (tos 0x0, ttl 64, id 51247, offset 0, flags [DF], proto TCP (6), length 60)
192.168.1.74.53684 > 192.168.1.73.5432: Flags [S], cksum 0x8412 (incorrect -> 0x3044), seq 1365069865, win 64240, options [mss 1460,sackOK,TS val 1756669887 ecr 0,nop,wscale 7], length 0

21:44:49.380439 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
192.168.1.73.5432 > 192.168.1.74.53684: Flags [S.], cksum 0x8412 (incorrect -> 0xaa4a), seq 1504104132, ack 1365069866, win 65160, options [mss 1460,sackOK,TS val 1200231003 ecr 1756669887,nop,wscale 7], length 0

21:44:49.380686 IP (tos 0x0, ttl 64, id 51248, offset 0, flags [DF], proto TCP (6), length 52)
192.168.1.74.53684 > 192.168.1.73.5432: Flags [.], cksum 0x840a (incorrect -> 0xd5a9), ack 1, win 502, options [nop,nop,TS val 1756669887 ecr 1200231003], length 0

At this point, socket is idle again.

Client (192.168.1.74):

tcp ESTAB 0 0 192.168.1.74:53684 192.168.1.73:5432 users:(("psql",pid=3849,fd=3)) timer:(keepalive,107min,0)

Server (192.168.1.73):

tcp ESTAB 0 0 192.168.1.73:5432 192.168.1.74:53684 users:(("postgres",pid=6177,fd=10)) timer:(keepalive,105min,0)

I believe in this case should be the client to implement some kind of keepalived.

Take the example of libpq: https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-KEEPALIVES and also psqlODBC: https://odbc.postgresql.org/docs/config-opt.html

What component are you using to connect to PGSQL?

My suggestion is to reach out the Talend's support as they may require implementations or they already have a solution - I checked the https://help.qlik.com/talend/en-US/components/8.0/ but still a newbie and not able.

Hope it helps!

Anonymous · ‎2024-09-30

Hello @David_Lan I've been writing for almost an hour and the forum didn't post my comment. I'm a bit frustrated; anyway, I'm pushing here a resume and whishig is enough to understand.

I was hable to reproduce your issue with a standard command line tool (psql for Linux) and by kiling the child created by PGSQL in the server.

After some tests and researchs, I belive the problem is related to the client TCP keepalive - which I believe is not sent in your case.

When the connection is closed from the server, the client stays in CLOSE-WAIT and when a further command are sent it tries to send acknowledge data packets over a closed socket.

Then, when the server responds that the connection is closed (RST), the client restart a new connection (three way handshake) and works as normal.

I don't know what connector are you using (?) but my suggestion is to reach out the Talend support as they may already have a solution or they may require further implementations - I reviewed the documention https://help.qlik.com/talend/en-US/components/8.0/ but still a newbie.

Some examples of clients that implement keepalive are:

- ODBC: https://odbc.postgresql.org/docs/config-opt.html

- libpq: https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-KEEPALIVES

Hope it helps

Regars,

Anonymous · ‎2024-09-30

Tomorrow I will try to re-organize and re-upload the dumps I collected.

v7.x