Performance issues with UNION of large tables
Performance issues with UNION of large tables
I have seven large tables, that can be storing between 100 to 1 million rows at any time. I'll call them LargeTable1
, LargeTable2
, LargeTable3
, LargeTable4
...LargeTable7
. These tables are mostly static: there are no updates nor new inserts. They change only once every two weeks or once a month, when they are truncated and a new batch of registers are inserted in each.
All these tables have three fields in common: Headquarter
, Country
and File
. Headquarter
and Country
are numbers in the format '000', though in two of these tables they are parsed as int
due to some other system necessities.
I have another, much smaller table called Headquarters
with the information of each headquarter. This table has very few entries. At most 1000, actually.
Now, I need to create a stored procedure that returns all those headquarters that appear in the large tables but are either absent in the Headquarters
table or have been deleted (this table is deleted logically: it has a DeletionDate
field to check this).
This is the query I've tried:
CREATE PROCEDURE deletedHeadquarters AS BEGIN DECLARE @headquartersFiles TABLE ( hq int, countryFile varchar(MAX) ); SET NOCOUNT ON INSERT INTO @headquartersFiles SELECT headquarter, CONCAT(country, ' (', file, ')') FROM ( SELECT DISTINCT CONVERT(int, headquarter) as headquarter, CONVERT(int, country) as country, file FROM LargeTable1 UNION SELECT DISTINCT headquarter, country, file FROM LargeTable2 UNION SELECT DISTINCT headquarter, country, file FROM LargeTable3 UNION SELECT DISTINCT headquarter, country, file FROM LargeTable4 UNION SELECT DISTINCT headquarter, country, file FROM LargeTable5 UNION SELECT DISTINCT headquarter, country, file FROM LargeTable6 UNION SELECT DISTINCT headquarter, country, file FROM LargeTable7 ) TC SELECT RIGHT('000' + CAST(st.headquarter AS VARCHAR(3)), 3) as headquarter, MAX(s.deletionDate) as deletionDate, STUFF ( (SELECT DISTINCT ', ' + st2.countryFile FROM @headquartersFiles st2 WHERE st2.headquarter = st.headquarter FOR XML PATH('')), 1, 1, '' ) countryFile FROM @headquartersFiles as st LEFT JOIN headquarters s ON CONVERT(int, s.headquarter) = st.headquarter WHERE s.headquarter IS NULL OR s.deletionDate IS NOT NULL GROUP BY st.headquarter END
This sp's performance isn't good enough for our application. It currently takes around 50 seconds to complete, with the following total rows for each table (just to give you an idea about the sizes):
- LargeTable1: 1516666 rows
- LargeTable2: 645740 rows
- LargeTable3: 1950121 rows
- LargeTable4: 779336 rows
- LargeTable5: 1100999 rows
- LargeTable6: 16499 rows
- LargeTable7: 24454 rows
What can I do to improve performance? I've tried to do the following, with no much difference:
- Inserting into the local table by batches, excluding those
I've also thought about inserting these missing headquarters in a permanent table after the LargeTables
change, but the Headquarters
table can change more often, and I would like not having to change its module to keep these things tidy and updated. But if it's the best possible alternative, I'd go for it.
Thanks
Answer by Hogan for Performance issues with UNION of large tables
Take this filter
LEFT JOIN headquarters s ON CONVERT(int, s.headquarter) = st.headquarter WHERE s.headquarter IS NULL OR s.deletionDate IS NOT NULL
And add it to each individual query in the union and insert into @headquartersFiles
It might seem like this makes a lot more filters but it will actually speed stuff up because you are filtering before you start processing as a union.
Also take out all your DISTINCT, it probably won't speed it up but it seems silly because you are doing a UNION and not a UNION all.
Answer by Tom H for Performance issues with UNION of large tables
I'd try doing the filtering with each individual table first. You just need to account for the fact that a headquarter might appear in one table, but not another. You can do this like so:
SELECT headquarter FROM ( SELECT DISTINCT headquarter, 'table1' AS large_table FROM LargeTable1 LT LEFT OUTER JOIN Headquarters HQ ON HQ.headquarter = LT.headquarter WHERE HQ.headquarter IS NULL OR HQ.deletion_date IS NOT NULL UNION ALL SELECT DISTINCT headquarter, 'table2' AS large_table FROM LargeTable2 LT LEFT OUTER JOIN Headquarters HQ ON HQ.headquarter = LT.headquarter WHERE HQ.headquarter IS NULL OR HQ.deletion_date IS NOT NULL UNION ALL ... ) SQ GROUP BY headquarter HAVING COUNT(*) = 5
That would make sure that it's missing from all five tables.
Answer by Eric for Performance issues with UNION of large tables
Table variables have horrible performance because sql server does not generate statistics for them. Instead of a table variable, try using a temp table instead, and if headquarter + country + file is unique in the temp table, add a unique constraint (which will create a clustered index) in the temp table definition. You can set indexes on a temp table after creating it, but for various reasons SQL Server may ignore it.
Edit: as it turns out, you can in fact create indexes on table variables, even non-unique in 2014+.
Secondly, try not to use functions in your joins or where clauses - doing so often causes performance problems.
Answer by Gordon Linoff for Performance issues with UNION of large tables
Do the filtering at each step. But first, modify the headquarters
table so it has the right type for what you need . . . along with an index:
alter table headquarters add headquarter_int as (cast(headquarter as int)); create index idx_headquarters_int on headquarters(headquarters_int); SELECT DISTINCT headquarter, country, file FROM LargeTable5 lt5 WHERE NOT EXISTS (SELECT 1 FROM headquarters s WHERE s.headquarter_int = lt5.headquarter and s.deletiondate is not null );
Then, you want an index on LargeTable5(headquarter, country, file)
.
This should take less than 5 seconds to run. If so, then construct the full query, being sure that the types in the correlated subquery match and that you have the right index on the full table. Use union
to remove duplicates between the tables.
Answer by Grayson for Performance issues with UNION of large tables
The real answer is to create separate INSERT
statements for each table with the caveat that data to be inserted does not exist in the destination table.
Fatal error: Call to a member function getElementsByTagName() on a non-object in D:\XAMPP INSTALLASTION\xampp\htdocs\endunpratama9i\www-stackoverflow-info-proses.php on line 72
0 comments:
Post a Comment