Performance issues with UNION of large tables ~ Discussion of Coding

Performance issues with UNION of large tables

I have seven large tables, that can be storing between 100 to 1 million rows at any time. I'll call them LargeTable1, LargeTable2, LargeTable3, LargeTable4...LargeTable7. These tables are mostly static: there are no updates nor new inserts. They change only once every two weeks or once a month, when they are truncated and a new batch of registers are inserted in each.

All these tables have three fields in common: Headquarter, Country and File. Headquarter and Country are numbers in the format '000', though in two of these tables they are parsed as int due to some other system necessities.

I have another, much smaller table called Headquarters with the information of each headquarter. This table has very few entries. At most 1000, actually.

Now, I need to create a stored procedure that returns all those headquarters that appear in the large tables but are either absent in the Headquarters table or have been deleted (this table is deleted logically: it has a DeletionDate field to check this).

This is the query I've tried:

CREATE PROCEDURE deletedHeadquarters  AS  BEGIN      DECLARE @headquartersFiles TABLE      (          hq int,          countryFile varchar(MAX)      );        SET NOCOUNT ON        INSERT INTO @headquartersFiles      SELECT headquarter, CONCAT(country, ' (', file, ')')      FROM      (          SELECT DISTINCT CONVERT(int, headquarter) as headquarter,                          CONVERT(int, country) as country,                          file          FROM            LargeTable1               UNION          SELECT DISTINCT headquarter,                          country,                          file          FROM            LargeTable2          UNION          SELECT DISTINCT headquarter,                          country,                          file          FROM            LargeTable3          UNION          SELECT DISTINCT headquarter,                          country,                          file          FROM            LargeTable4          UNION          SELECT DISTINCT headquarter,                          country,                          file          FROM            LargeTable5          UNION          SELECT DISTINCT headquarter,                          country,                          file          FROM            LargeTable6          UNION          SELECT DISTINCT headquarter,                          country,                          file          FROM            LargeTable7      ) TC        SELECT  RIGHT('000' + CAST(st.headquarter AS VARCHAR(3)), 3) as headquarter,              MAX(s.deletionDate) as deletionDate,              STUFF              (                  (SELECT DISTINCT ', ' + st2.countryFile                  FROM @headquartersFiles st2                  WHERE st2.headquarter = st.headquarter                  FOR XML PATH('')),                  1,                  1,                  ''              ) countryFile      FROM    @headquartersFiles as st      LEFT JOIN headquarters s ON CONVERT(int, s.headquarter) = st.headquarter      WHERE   s.headquarter IS NULL         OR   s.deletionDate IS NOT NULL      GROUP BY st.headquarter    END

This sp's performance isn't good enough for our application. It currently takes around 50 seconds to complete, with the following total rows for each table (just to give you an idea about the sizes):

LargeTable1: 1516666 rows
LargeTable2: 645740 rows
LargeTable3: 1950121 rows
LargeTable4: 779336 rows
LargeTable5: 1100999 rows
LargeTable6: 16499 rows
LargeTable7: 24454 rows

What can I do to improve performance? I've tried to do the following, with no much difference:

Inserting into the local table by batches, excluding those headquarters I've already inserted and then updating the countryFile field for those that are repeated
Creating a view for that UNION query
Creating indexes for the LargeTables for the headquarter field

I've also thought about inserting these missing headquarters in a permanent table after the LargeTables change, but the Headquarters table can change more often, and I would like not having to change its module to keep these things tidy and updated. But if it's the best possible alternative, I'd go for it.

Thanks

Answer by Hogan for Performance issues with UNION of large tables

Take this filter

LEFT JOIN headquarters s ON CONVERT(int, s.headquarter) = st.headquarter  WHERE   s.headquarter IS NULL     OR   s.deletionDate IS NOT NULL

And add it to each individual query in the union and insert into @headquartersFiles

It might seem like this makes a lot more filters but it will actually speed stuff up because you are filtering before you start processing as a union.

Also take out all your DISTINCT, it probably won't speed it up but it seems silly because you are doing a UNION and not a UNION all.

Answer by Tom H for Performance issues with UNION of large tables

I'd try doing the filtering with each individual table first. You just need to account for the fact that a headquarter might appear in one table, but not another. You can do this like so:

SELECT      headquarter  FROM  (        SELECT DISTINCT          headquarter,          'table1' AS large_table      FROM          LargeTable1 LT      LEFT OUTER JOIN Headquarters HQ ON HQ.headquarter = LT.headquarter      WHERE          HQ.headquarter IS NULL OR          HQ.deletion_date IS NOT NULL      UNION ALL      SELECT DISTINCT          headquarter,          'table2' AS large_table      FROM          LargeTable2 LT      LEFT OUTER JOIN Headquarters HQ ON HQ.headquarter = LT.headquarter      WHERE          HQ.headquarter IS NULL OR          HQ.deletion_date IS NOT NULL      UNION ALL      ...  ) SQ  GROUP BY headquarter  HAVING COUNT(*) = 5

That would make sure that it's missing from all five tables.

Answer by Eric for Performance issues with UNION of large tables

Table variables have horrible performance because sql server does not generate statistics for them. Instead of a table variable, try using a temp table instead, and if headquarter + country + file is unique in the temp table, add a unique constraint (which will create a clustered index) in the temp table definition. You can set indexes on a temp table after creating it, but for various reasons SQL Server may ignore it.

Edit: as it turns out, you can in fact create indexes on table variables, even non-unique in 2014+.

Secondly, try not to use functions in your joins or where clauses - doing so often causes performance problems.

Answer by Gordon Linoff for Performance issues with UNION of large tables

Do the filtering at each step. But first, modify the headquarters table so it has the right type for what you need . . . along with an index:

alter table headquarters add headquarter_int as (cast(headquarter as int));  create index idx_headquarters_int on headquarters(headquarters_int);    SELECT DISTINCT headquarter, country, file  FROM LargeTable5 lt5  WHERE NOT EXISTS (SELECT 1                    FROM headquarters s                    WHERE s.headquarter_int = lt5.headquarter and s.deletiondate is not null                   );

Then, you want an index on LargeTable5(headquarter, country, file).

This should take less than 5 seconds to run. If so, then construct the full query, being sure that the types in the correlated subquery match and that you have the right index on the full table. Use union to remove duplicates between the tables.

Answer by Grayson for Performance issues with UNION of large tables

The real answer is to create separate INSERT statements for each table with the caveat that data to be inserted does not exist in the destination table.

Fatal error: Call to a member function getElementsByTagName() on a non-object in D:\XAMPP INSTALLASTION\xampp\htdocs\endunpratama9i\www-stackoverflow-info-proses.php on line 72

Discussion of Coding

Blog coding and discussion of coding about JavaScript, PHP, CGI, general web building etc.

Thursday, August 18, 2016