On-Call (uh-oh) During Daylight Savings Time
What is Daylight Savings Time?
As the saying goes, we “spring forward, and fall back”, at least if you’re in a part of the world that observes daylight savings time (DST). In the spring, on the second Sunday of March, at 2:00 AM, we turn our clocks forward one hour and it becomes 3:00 AM. In the fall, on the first Sunday of November, at 2:00 AM, we turn our clocks backward one hour and it becomes 1:00 AM. We essentially have a 23 hour day in March and a 25 hour day in November. While writing this I was surprised to find out that we actually observed a different schedule prior to 2007, thanks to the Energy Policy Act of 2005. I can only imagine all of the technologies that had to adapt to account for the change, though maybe not too bad in 2007, what a pain it would be today.
Processing Failures on March 8th, 2020
I work with large amounts of data on a daily basis, ensuring that we process that data in a time-sensitive manner. We have a weekly on-call rotation and I was nearing the end of my on-call week. If a process fails I will investigate, troubleshoot, and get processing back up and running. Usually, failures are not all that exciting or interesting, but on March 8th, 2020 around 1:00 AM PST, I was notified there had been two processing failures, which just so happened to be the day that we changed our clocks for DST.
The two processes that failed were both SQL Server Agent jobs. Job A was scheduled to run at 2:00 AM CST, and Job B was scheduled to run at 2:45 AM CST. Both had failed while being executed from server 1, logging a start time of about 3:00 AM CDT. Job A was expected to pull six files from server 2 into directory A on server 1. Job A pulled one file from server 2, the file that was pulled had around 10% of the expected contents. Job B was expected to pull fifteen files from server 2 into directory B on server 1. Job B pulled six files from server 2, not only was it missing files it also pulled the wrong files. Job C and Job D had succeeded on Server 3.
How does SQL Server Agent handle DST?
On the second Sunday in March, when the clock ticks forward one second from 1:59:59 AM, it springs forward to 3:00:00 AM. We skip an hour, losing the time spanning from 2:00 AM to 3:00 AM. SQL Server Agent seems to handle this by then kicking off any job that was scheduled during the “lost” hour to avoid skipping the scheduled job. I’ve seen mixed reports of how SQL Server Agent handles DST online, however, this is what I observed.
On the first Sunday of November, when the clock ticks forward one second from 1:59:59 AM, it falls back to 1:00:00 AM. We gain an hour. SQL Server Agent will pause for an hour (i.e. no jobs will be started) and only resume once it gets to 2:00 AM. I haven’t actually encountered this scenario myself, so I’m basing this off of what I’ve read online.
Keep in mind that these time changes are based on the time of the local machine. In fact, server 3 did not spring forward until sometime between 2:02 AM and 2:05 AM, not nearly as precise as server 1. Unfortunately I don’t manage server 1, so I can’t look further into why there was a delay, possibly a DBA did this is by design I’m not sure.
So what actually happened?
As the clock ticked forward on server 1 at 1:59:59 AM it rolled over to 3:00 AM, and SQL Server Agent recognized two scheduled jobs had been skipped and kicked them both off. Normally Job A and Job B would be separated by 45 minutes but due to the time change they both started at the same time. This was the root of the problem, more specifically the fact that Job B was kicked off at 3:00 AM.
Job A kicked off at 3:00 AM and began pulling files from server 2, it was part way through copying 1 out of 6 files currently on server 2 before having a connection issue. The cause of this was that Job B was also trying to connect to server 2 via SSH at the same time. It turned out server 2 was setup to only allow a user a single sign on instance resulting in Job A’s connection being dropped. Job B connected and began pulling files from server 2, it pulled 6 files from server 2. Normally Job B would grab 15 files but only 6 were present, and they were the wrong files. Job B failed trying to process a file that it was intended to process. Job B had actually grabbed the 6 files intended for Job A, and Job B’s files had not been pushed to server 2 yet by Job D yet. See, Job C ran at 1 AM so files were ready for Job A to pick up, but Job A was unfortunately interrupted. Job D which generates the files for Job B started at 2 AM (or 3 AM) like expected. But because Job B was scheduled to pull those files at 2:45 AM, in other words Job B needs to wait 45 minutes from when Job D starts, and the daylight savings time change skipped over the scheduled time, SQL Server Agent then defaulted to kick it off at 3AM which meant Job B was trying to pull files while Job D was trying to deposit them, effectively jumping the gun. Had Job B started at 3:45 AM things would have been just fine.
What could have been done differently?
There are a few different options to better handle this going forward, each with their own trade-offs that should be considered depending on your use case.
- Avoiding scheduling reoccurring jobs between 1 and 3 am if possible
- Disable jobs on daylight savings and manually kick them off
- Create a schedule for jobs that is specific to the day DST occurs
Separate from DST considerations, two other improvements to consider.
- Do we really need to limit ssh to 1 connection per user for that server
- File drop off location on server 2 should be more specific (a subfolder for Job C and Job D respectively) rather than sharing the same location to ensure no intermingling of files