Background
Our customer has recently experienced an issue during the installation of the latest cumulative updates in their SQL Server environment. Instances for one specific application ended in a crashed state. The update process was upgrading SQL Server 2017 from CU6 to CU9. The error was occurring during the phase when databases are upgraded via update scripts (which follows after the installation of binary files and SQL Server start).
Inside an errorlog we found that the patch process failed due to an error inside a sp_vupgrade_replication procedure:
It was strange, because a replication was not enabled on any affected instances/databases. We wanted to confirm our suspicion that only these user databases are causing troubles, so we put them into an offline state and rollbacked & reinstalled CU again. As long as they remained offline when the upgrade script was running, everything went smoothly! This means that we have successfully isolated the problem which we could debug.
Root cause analysis
To investigate this issue further we rolled back patches one more time and enabled trace flag 902 in startup options of SQL Server. This TF bypasses execution of database upgrade scripts and allows us to run them manually later. Before we did that, we had created an extended event session to investigate what could possibly go wrong:
CREATE EVENT SESSION [upgrade]
ON SERVER
ADD EVENT sqlserver.error_reported,
ADD EVENT sqlserver.errorlog_written,
ADD EVENT sqlserver.sp_statement_completed,
ADD EVENT sqlserver.sp_statement_starting
(WHERE (
[sqlserver].[like_i_sql_unicode_string]([sqlserver].[sql_text],
N’%MSreplication_subscriptions%’))),
ADD EVENT sqlserver.sql_batch_completed,
ADD EVENT sqlserver.sql_statement_completed
(SET collect_statement=(1))
ADD TARGET package0.event_file(SET filename=N’upgrade’,max_file_size=(5))
GO
The upgrade scripts failed even when run manually, which was not a surprise. Fortunately, this time we were able to catch some interesting data in XE session where we found the exact same error as in an errorlog:
The last command which was run a few microseconds before was a breakthrough. The statement was looking for the existence of objects MSreplication_subscriptons and MSsubscriptions_agents.
Why a logical disjunction was used inside an IF EXISTS condition is a mystery.
IF EXISTS (SELECT * FROM sys.objects WHERE name = ‘MSreplication_subscriptions’ )
or EXISTS
(SELECT * FROM sys.objects WHERE name = ‘MSsubscription_agents’)
It surprises me how a query like this can get through the code review in Microsoft but it is what it is ?. We have checked that just one tested object existed. Unfortunately, the script was working with both in later code and therefore it errored and the whole patch process failed.
Resolution
The table, which caused trouble in our case, was MSsubscription_agents. It stores some information about replication subscribers.
As it showed up, the database was part of SQL Server Replication topology in the past and even though this feature was removed correctly, somebody forgot to delete information about subscribers.
When we have manually run a procedure sys.sp_subscription_cleanup for all forgotten subscribers,
EXEC sys.sp_subscription_cleanup @publisher = ‘xxxxxxxxxxx’, @publisher_db = ‘xxxxxxxxxxx’
the object MSsubscription_agents disappeared and we could repeat the installation of CU. This time it went smoothly.
The key takeaway for the customer was to review a process how replications were removed instances. From the point of view of a supporting company, we appreciate the ability to use a TF902 which can help us to detect the errors in upgrade script.
It’s also important to mention that the error was most likely not connected only to SQL Server 2017 CU9. This particular database was migrated from SQL Server 2008 R2 recently and this was the first round of patching on top of the 2017 version. So it could appear with earlier (and possibly later) versions of CU as well.