Skip to content

Watch stops working after 4 days and receives duplicate events on delete pod #1233

Description

@Kruti-Joshi

Describe the bug
I'm implementing a Pod watcher using the C# Kubernetes client. https://github.com/kubernetes-client/csharp
This is the current implementation we have -

public async Task WatchPodsAsync()
        {
            Log.Information($"Starting pod watcher");
            string lastResourceVersion = null;

            // Watcher connection resets if there are no changes for some time (approx. 20 min), so we need to fetch the list again everytime this happens.
            do
            {
                // A regular list call is needed to fetch the current resource version of the pod list. For the next call, when the connection is reset, this resource version will be specified in the watch call to get everything after this current resource version, so that the watcher doesn't receive all events from scratch.
                Log.Information("Refetching the list");
                var resetPodList = await client.CoreV1.ListNamespacedPodAsync(NamespaceToWatch, resourceVersion: lastResourceVersion);
                lastResourceVersion = resetPodList.ResourceVersion();
                var podList = client.CoreV1.ListNamespacedPodWithHttpMessagesAsync(NamespaceToWatch, resourceVersion: lastResourceVersion, watch: true);
                try
                {
                    await foreach (var (type, item) in podList.WatchAsync<V1Pod, V1PodList>())
                    {
                        Log.Information($"Watcher detected event of type {type} for pod {item.Metadata.Name}. Status of the pod is {item.Status.Phase}.");
                        // some action
                    }
                }
                catch (Exception ex)
                {
                    Log.Warning(ex, $"Exception occured while watching pods. Establishing the connection again.");
                }
            }
            while (true);
        }

Because of the issue we faced earlier about the connection being closed because of inactivity, we put the watcher in an infinite loop so that the connection is re-established every time it is lost.
Now I'm seeing 2 major issues in the watcher -

  1. After 4 days, the watcher stopped receiving events.
    Earlier, the connection was being lost in an hour. After putting the implementation in an infinte loop, I can see that the connection is being re-established and i can see the log "Refetching the list", but suddenly after 4 days, there were no events being received. On restarting the service, the watcher started receiving updates again and worked as expected.
    Is there something wrong in this implementation? Why would the connection be lost after 4 days, event under an infinite loop?

  2. Every time the watcher is receiving a delete event, it also receives the previous event of the pod.
    E.g. - The below pod had already completed and watcher had received the update. When I deleted the pod using kubectl delete pod, watcher received the Modified type update once again, before the delete type.

    [2023-03-14 05:06:43.722 +00:00] [INF] Watcher detected event of type
    Modified for pod e0c7205c-e096-4ee0-b9c4-7b044e442970-krf8v. Status of
    the pod is Succeeded.
    [2023-03-14 05:08:09.478 +00:00] [INF] Watcher detected event of type Modified for pod e0c7205c-e096-4ee0-b9c4-7b044e442970-krf8v.
    Status of the pod is Succeeded.
    [2023-03-14 05:08:09.483 +00:00] [INF] Watcher detected event of type Deleted for pod e0c7205c-e096-4ee0-b9c4-7b044e442970-krf8v.
    Status of the pod is Succeeded.

Is this an expected behaviour? We are keeping a count of how many runs were triggered, and how many completed, and receiving the modified 'succeeded' event again is throwing off our count. Is there something wrong in the implementation that can be corrected to stop receiving this event twice?

Kubernetes C# SDK Client Version
8.0.68

Server Kubernetes Version
1.24.6

Dotnet Runtime Version
net7

To Reproduce
Have the watcher running for at least 5-6 days.
While the watcher is running, see that the event for the latest Completed pod has already been received. Now delete the pod using
kubectl delete pod.. and check the logs again. Just before the delete event, the watcher has also received the completed event once again.
After 4-5 days, check if the watcher still receives the events. The events suddenly stop coming, even though we have an infinite loop to reset the connection. Restart the service. Notice that the watcher starts working again and performs actions on the received events.

Expected behavior
The watcher should receive the completed event only once.
If the connection is lost, because of the infinite loop, the connection should be reset again and the watcher should continue working.

Where do you run your app with Kubernetes SDK (please complete the following information):

  • OS: Linux
  • Environment: Kubenetes container
  • Cloud: Azure

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions