Wednesday, March 21, 2012

detecting & deleting duplicates in batch vs proc

I know how to detect & delete dups/or >dups in test with a select clause, this works fine in a small table, but if the table has a million rows say, it sounds like a proc would be faster: my question is: How do I display those rows in a proc for detecting what the problem is. The print stmt. doesn't seem to work and I wondered if I had to go through the process of building an output stream. The proc creates okay but I'm stuck after that part.

thx

Kat -- very rough code below

SET QUOTED_IDENTIFIER ON
GO
SET ANSI_NULLS ON
GO
create proc dupcount
@.count int
as
set nocount on
select categoryID,
CategoryName,
Count(*) As Dups
from Categories
group by Categoryid, CategoryName
having count(*) >1

set @.count = @.@.Rowcount
print convert(varchar(30),CategoryID) + ' ' + convert(varchar(30),@.count) + CategoryName

/*set @.count = @.@.RowCount
IF @.count > 1
print convert CategoryID, CategoryName, convert(varchar(30),@.count)*/

hello kat,

I recommend you save the records to be deleted into a table. Thats for reporting purpose and

just incase yo might have deleted something important, for backup purposes

print statement would not be ideal for a very large operations, besides its too slow

regards,

joey

|||

Hi Joey,

Thx for all your help... should I do this in a batch or in a proc?

Kind Regards,

Kat

|||

hi kat,

batch operation will do since this is just a one time operation.

you dont want to have so many SP's in your db.

anyway. Are you done with the delete statement?

I mean the delete query which will delete the duplicates?

joey

|||

Yes, thank you very much! I'm trying to find test situations and how to resolve them. This was one that I've seen happen before.

Kat

|||

joeydj wrote:

hello kat,

I recommend you save the records to be deleted into a table. Thats for reporting purpose and

just incase yo might have deleted something important, for backup purposes

print statement would not be ideal for a very large operations, besides its too slow

regards,

joey

Never save deleted records. Transactions and error handling ensure wanted records are not deleted.

If you want to delete them, delete them and be done with it. It's going to save a lot of purging time later on.

...and PRINT is for Query Analyzer and Management Studio not for application display.

Adamus

|||

If you want to display them in an application form, use SELECT not PRINT.

Research "Views" in SQL Server.

Adamus

|||

hello adam,

lets help the girl.

how do we delete duplicates

and of course leave one of them in the table

lets say i have a table name tablex

tablex

id , lastname, first,

1, cruz, juan

1, cruz, juan carlos

2, penduko, pedro

2, penduko, peter

2, penduko, pedro

3, falaypay,pacifico

3, falaypay, pacifico

3, falapay, pacifica

4, khan, cynthia

4, luster, cynthia - got married

4, khan, agnes cynthia - the encoder missed the second name

In this example i have a very dirty employees table in the data warehouse and i have 8 million of them. and it happens in real life. hahaha. if this is a warehouse there should be a timestamp but sad to say the architect has failed to remember of placing one

how can i remove the duplicates and maintain the integrity. Ok never mind integrity how do we delete the duplicate first just using the id as qualifier

regards,

joey

|||I was just wondering if anyone had an answer to Joey's post. It appears that I would have to manually look at the data to resolve the issue, any ideas?|||

You can detect duplicates using query below:

select id

from tbl

group by id

having count(*) > 1

For the delete logic to work efficiently, you need some other key column(s) in the table to determine which row to keep. This can also be some columns that uniquely identifies a row within each group.

delete tbl

where timestamp < (select max(t1.timestamp) from tbl as t1 where t1.id = tbl.id)

go

For these type of tables in a data warehouse that doesn't have any constraints you can add an identity column and then do below:

alter table tbl add uqid int not null identity

go

delete tbl

where uqid > (select min(t1.uqid) from tbl as t1 where t1.id = tbl.id)

go

If you have large number of rows to delete then you can perform the delete in batches like:

-- Works only on SQL Server 2005. Uses TOP clause support in DML

declare @.n int

set @.n = 1000

while(1=1)

begin

-- delete N rows at a time

delete top(@.n) tbl

where uqid > (select min(t1.uqid) from tbl as t1 where t1.id = tbl.id)

if @.@.rowcount = 0 break

end

go

-- Older versions of SQL Server

declare @.n int

set @.n = 1000

set rowcount @.n

while(1=1)

begin

-- delete N rows at a time

delete tbl

where uqid > (select min(t1.uqid) from tbl as t1 where t1.id = tbl.id)

if @.@.rowcount = 0 break

end

set rowcount 0

go

If you cannot add identity column or have other columns that identifies a duplicate within each group you can use approach below. This will however be slower due to use of cursor.

declare @.id int, @.cnt int

declare @.dupes cursor

set @.dupes =

cursor fast_forward for

select id, count(*) - 1

from tbl

group by id

having count(*) > 1

open @.dupes

while(1=1)

begin

fetch @.dupes into @.id, @.cnt

if @.@.fetch_status < 0 break

/* Older versions of SQL Server

set rowcount @.cnt -- don't care which N-1 rows are deleted

delete tbl

where id = @.id

*/

/* SQL2005

delete top(@.cnt) tbl

where id = @.id

*/

end

/* Older versions of SQL Server

set rowcount 0

*/

go

However, this is a bad way to maintain a data warehouse. You need to build rules into the data loading and transformation process to track changes easily. You need to model your dimensions so that they can accomodate changes to attribute values. There are many standard technique available to do this.

|||

DECLARE c1 CURSOR FAST_FORWARD FOR

AS

DECLARE @.FirstName varchar(25)

Select firstname, count(firstname)

from Employees

Group by firstname

having count(firstname) > 1

open c1

fetch next from c1

into @.Firstname

While @.@.fetch_status = 0

BEGIN

insert into DeletedItems(firstname) Values(@.FirstName)

Delete from Employees

Where firstname = @.Firstname

END

fetch next from c1

into @.Firstname

CLOSE c1

DEALLOCATE c1

|||

Hi Adamus,

I thought 'cursors' were the bane of SQL, if I had a table with millions of rows, wouldn't this take forever? Or is this something new in 2005?

thx

|||

katgreen777 wrote:

Hi Adamus,

I thought 'cursors' were the bane of SQL, if I had a table with millions of rows, wouldn't this take forever? Or is this something new in 2005?

thx

CTE is the replacement in 2005 and I'm learning to code it, but yes cursors do carry a lot of overhead. Notice I use a "fast_forward" It makes a big difference.

My argument with cursors and millions of rows is, if you're dealing with millions of rows, there is no lightning fix other than hardware enhancements. You could argue until you're blue in the face with seconds, execution plans, alternative methods, and the list goes on, only to discover a well written cursor and ample thought will give you the result you need.

Adamus

|||

katgreen777 wrote:

Hi Adamus,I thought 'cursors' were the bane of SQL, if I had a table with millions of rows, wouldn't this take forever? Or is this something new in 2005?

Please take a look at the code that I posted before. You don't need cursors unless your table(s) doesn't have any key information or unique identifiers or other required columns. To delete large number of rows efficiently, you can use a batching mechanism using SET ROWCOUNT or TOP. See the examples I posted. And CTEs as mentioned in the other post doesn't really help in this case. CTE is just a query expression similar to derived tables or views and provides some additional capabilities like writing recursive queries.

|||

Hi,

Looking at the 'dirty' warehouse table above, wouldn't we just use the id number to delete the rows because that is unique and the firstname isnt?

thx

sql

No comments:

Post a Comment