When deferring GKEStartPodOperator, sometimes the pod finishes successfully but airflow marks it as a failure, typically when this happens 2 things happen :
1. right after the deferral starts, a ConnectTimeoutError happens :
[2025-02-01, 00:21:49 UTC] {credentials_provider.py:402} INFO - Getting connection using `google.auth.default()` since no explicit credentials are provided.
[2025-02-01, 00:21:49 UTC] {taskinstance.py:288} INFO - Pausing task as DEFERRED. dag_id=price_maps_gke_FR_PROD, task_id=run_price_maps_step1_fr_appt, run_id=scheduled__2025-01-01T00:15:00+00:00, execution_date=20250101T001500, start_date=20250201T002148
[2025-02-01, 00:21:49 UTC] {taskinstance.py:340} â–¶ Post task execution logs
[2025-02-02, 00:26:40 UTC] {local_task_job_runner.py:123} â–¶ Pre task execution logs
[2025-02-02, 00:26:41 UTC] {connection.py:277} WARNING - Connection schemes (type: google_cloud_platform) shall not contain '_' according to RFC3986.
[2025-02-02, 00:26:41 UTC] {base.py:84} INFO - Retrieving connection 'google_cloud_default'
[2025-02-02, 00:26:41 UTC] {credentials_provider.py:402} INFO - Getting connection using `google.auth.default()` since no explicit credentials are provided.
[2025-02-02, 00:28:51 UTC] {connectionpool.py:868} WARNING - Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7a0ae6da1e50>, 'Connection to <redacted IP address> timed out. (connect timeout=None)')': /api/v1/namespaces/default/pods/price-maps-step1-fr-appt-952mvt6q
[2025-02-02, 00:29:55 UTC] {pod.py:834} INFO - [base] logs: <Pod logs start streaming correctly>
2. Once the Pod finished and all logs are streamed correctly, a traceback for a aiohttp.client_exceptions.ClientConnectorError towards the same IP is printed :
[2025-02-02 01:32:14.118085+00:00] {pod.py:834} INFO - [base] logs: <Container logs reporting success of the task>
[2025-02-02 01:47:04.545146+00:00] {pod_manager.py:603} INFO - Pod price-maps-step1-fr-appt-952mvt6q has phase Running [2025-02-02 01:47:06.578366+00:00] {pod.py:966} INFO - Deleting pod: price-maps-step1-fr-appt-952mvt6q
[2025-02-02 01:47:06.736706+00:00] {taskinstance.py:3312} ERROR - Task failed with exception Traceback (most recent call last):
Traceback (most recent call last):
File "/opt/python3.11/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 768, in _execute_task
result = _execute_callable(context=context, **execute_callable_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/python3.11/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 734, in _execute_callable
return ExecutionCallableRunner(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/python3.11/lib/python3.11/site-packages/airflow/utils/operator_helpers.py", line 252, in run
return self.func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/python3.11/lib/python3.11/site-packages/airflow/models/baseoperator.py", line 1824, in resume_execution
return execute_callable(context)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/python3.11/lib/python3.11/site-packages/airflow/providers/google/cloud/operators/kubernetes_engine.py", line 809, in execute_complete
return super().trigger_reentry(context, event)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/python3.11/lib/python3.11/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 759, in trigger_reentry
raise AirflowException(message)
airflow.exceptions.AirflowException: Traceback (most recent call last):
File "/opt/python3.11/lib/python3.11/site-packages/aiohttp/connector.py", line 1109, in _wrap_create_connection
sock = await aiohappyeyeballs.start_connection(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/python3.11/lib/python3.11/site-packages/aiohappyeyeballs/impl.py", line 104, in start_connection
raise first_exception
File "/opt/python3.11/lib/python3.11/site-packages/aiohappyeyeballs/impl.py", line 82, in start_connection
sock = await _connect_sock(
^^^^^^^^^^^^^^^^^^^^
File "/opt/python3.11/lib/python3.11/site-packages/aiohappyeyeballs/impl.py", line 174, in _connect_sock
await loop.sock_connect(sock, address)
File "/opt/python3.11/lib/python3.11/asyncio/selector_events.py", line 638, in sock_connect
return await fut
^^^^^^^^^
File "/opt/python3.11/lib/python3.11/asyncio/selector_events.py", line 678, in _sock_connect_cb
raise OSError(err, f'Connect call failed {address}')
TimeoutError: [Errno 110] Connect call failed ('<Same IP as previous error>', 443)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/python3.11/lib/python3.11/site-packages/airflow/providers/cncf/kubernetes/triggers/pod.py", line 162, in run
event = await self._wait_for_container_completion()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/python3.11/lib/python3.11/site-packages/airflow/providers/cncf/kubernetes/triggers/pod.py", line 226, in _wait_for_container_completion
pod = await self.hook.get_pod(self.pod_name, self.pod_namespace)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/python3.11/lib/python3.11/site-packages/airflow/providers/cncf/kubernetes/hooks/kubernetes.py", line 757, in get_pod
pod: V1Pod = await v1_api.read_namespaced_pod(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/python3.11/lib/python3.11/site-packages/kubernetes_asyncio/client/api_client.py", line 185, in __call_api
response_data = await self.request(
^^^^^^^^^^^^^^^^^^^
File "/opt/python3.11/lib/python3.11/site-packages/kubernetes_asyncio/client/rest.py", line 198, in GET
return (await self.request("GET", url,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/python3.11/lib/python3.11/site-packages/kubernetes_asyncio/client/rest.py", line 182, in request
r = await self.pool_manager.request(**args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/python3.11/lib/python3.11/site-packages/aiohttp/client.py", line 663, in _request
conn = await self._connector.connect(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/python3.11/lib/python3.11/site-packages/aiohttp/connector.py", line 538, in connect
proto = await self._create_connection(req, traces, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/python3.11/lib/python3.11/site-packages/aiohttp/connector.py", line 1050, in _create_connection
_, proto = await self._create_direct_connection(req, traces, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/python3.11/lib/python3.11/site-packages/aiohttp/connector.py", line 1384, in _create_direct_connection
raise last_exc
File "/opt/python3.11/lib/python3.11/site-packages/aiohttp/connector.py", line 1353, in _create_direct_connection
transp, proto = await self._wrap_create_connection(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/python3.11/lib/python3.11/site-packages/aiohttp/connector.py", line 1124, in _wrap_create_connection
raise client_error(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host <Same IP as previous error>:443 ssl:default [Connect call failed ('<Same IP as previous error>', 443)]
[2025-02-02 01:47:06.746442+00:00] {taskinstance.py:1226} INFO - Marking task as FAILED. dag_id=price_maps_gke_FR_PROD, task_id=run_price_maps_step1_fr_appt, run_id=scheduled__2025-01-01T00:15:00+00:00, execution_date=20250101T001500, start_date=20250201T002148, end_date=20250202T014706
[2025-02-02 01:47:06.746828+00:00] {taskinstance.py:1564} INFO - Executing callback at index 0: slack_failure_alert
[2025-02-02 01:47:07.359025+00:00] {taskinstance.py:340} â–¶ Post task execution logs
This is problematic because other downsteam tasks also fail because of this, and i also can't ignore the error because those tasks depend on objects created by this pod.