I know how to detect & delete dups/or >dups in test with a select clause, this works fine in a small table, but if the table has a million rows say, it sounds like a proc would be faster: my question is: How do I display those rows in a proc for detecting what the problem is. The print stmt. doesn't seem to work and I wondered if I had to go through the process of building an output stream. The proc creates okay but I'm stuck after that part.
thx
Kat -- very rough code below
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_NULLS ON
GO
create proc dupcount
@.count int
as
set nocount on
select categoryID,
CategoryName,
Count(*) As Dups
from Categories
group by Categoryid, CategoryName
having count(*) >1
set @.count = @.@.Rowcount
print convert(varchar(30),CategoryID) + ' ' + convert(varchar(30),@.count) + CategoryName
/*set @.count = @.@.RowCount
IF @.count > 1
print convert CategoryID, CategoryName, convert(varchar(30),@.count)*/
hello kat,
I recommend you save the records to be deleted into a table. Thats for reporting purpose and
just incase yo might have deleted something important, for backup purposes
print statement would not be ideal for a very large operations, besides its too slow
regards,
joey
|||Hi Joey,
Thx for all your help... should I do this in a batch or in a proc?
Kind Regards,
Kat
|||hi kat,
batch operation will do since this is just a one time operation.
you dont want to have so many SP's in your db.
anyway. Are you done with the delete statement?
I mean the delete query which will delete the duplicates?
joey
|||Yes, thank you very much! I'm trying to find test situations and how to resolve them. This was one that I've seen happen before.
Kat
|||Never save deleted records. Transactions and error handling ensure wanted records are not deleted.joeydj wrote: hello kat,
I recommend you save the records to be deleted into a table. Thats for reporting purpose and
just incase yo might have deleted something important, for backup purposes
print statement would not be ideal for a very large operations, besides its too slow
regards,
joey
If you want to delete them, delete them and be done with it. It's going to save a lot of purging time later on.
...and PRINT is for Query Analyzer and Management Studio not for application display.
Adamus
|||If you want to display them in an application form, use SELECT not PRINT.
Research "Views" in SQL Server.
Adamus
|||hello adam,
lets help the girl.
how do we delete duplicates
and of course leave one of them in the table
lets say i have a table name tablex
tablex
id , lastname, first,
1, cruz, juan
1, cruz, juan carlos
2, penduko, pedro
2, penduko, peter
2, penduko, pedro
3, falaypay,pacifico
3, falaypay, pacifico
3, falapay, pacifica
4, khan, cynthia
4, luster, cynthia - got married
4, khan, agnes cynthia - the encoder missed the second name
In this example i have a very dirty employees table in the data warehouse and i have 8 million of them. and it happens in real life. hahaha. if this is a warehouse there should be a timestamp but sad to say the architect has failed to remember of placing one
how can i remove the duplicates and maintain the integrity. Ok never mind integrity how do we delete the duplicate first just using the id as qualifier
regards,
joey
|||I was just wondering if anyone had an answer to Joey's post. It appears that I would have to manually look at the data to resolve the issue, any ideas?|||You can detect duplicates using query below:
select id
from tbl
group by id
having count(*) > 1
For the delete logic to work efficiently, you need some other key column(s) in the table to determine which row to keep. This can also be some columns that uniquely identifies a row within each group.
delete tbl
where timestamp < (select max(t1.timestamp) from tbl as t1 where t1.id = tbl.id)
go
For these type of tables in a data warehouse that doesn't have any constraints you can add an identity column and then do below:
alter table tbl add uqid int not null identity
go
delete tbl
where uqid > (select min(t1.uqid) from tbl as t1 where t1.id = tbl.id)
go
If you have large number of rows to delete then you can perform the delete in batches like:
-- Works only on SQL Server 2005. Uses TOP clause support in DML
declare @.n int
set @.n = 1000
while(1=1)
begin
-- delete N rows at a time
delete top(@.n) tbl
where uqid > (select min(t1.uqid) from tbl as t1 where t1.id = tbl.id)
if @.@.rowcount = 0 break
endgo
-- Older versions of SQL Server
declare @.n int
set @.n = 1000
set rowcount @.n
while(1=1)
begin
-- delete N rows at a time
delete tbl
where uqid > (select min(t1.uqid) from tbl as t1 where t1.id = tbl.id)
if @.@.rowcount = 0 break
end
set rowcount 0
go
If you cannot add identity column or have other columns that identifies a duplicate within each group you can use approach below. This will however be slower due to use of cursor.
declare @.id int, @.cnt int
declare @.dupes cursor
set @.dupes =
cursor fast_forward for
select id, count(*) - 1
from tbl
group by id
having count(*) > 1
open @.dupes
while(1=1)
begin
fetch @.dupes into @.id, @.cnt
if @.@.fetch_status < 0 break
/* Older versions of SQL Server
set rowcount @.cnt -- don't care which N-1 rows are deleted
delete tbl
where id = @.id
*/
/* SQL2005
delete top(@.cnt) tbl
where id = @.id
*/
end
/* Older versions of SQL Server
set rowcount 0
*/
go
However, this is a bad way to maintain a data warehouse. You need to build rules into the data loading and transformation process to track changes easily. You need to model your dimensions so that they can accomodate changes to attribute values. There are many standard technique available to do this.
|||DECLARE c1 CURSOR FAST_FORWARD FOR
AS
DECLARE @.FirstName varchar(25)
Select firstname, count(firstname)
from Employees
Group by firstname
having count(firstname) > 1
open c1
fetch next from c1
into @.Firstname
While @.@.fetch_status = 0
BEGIN
insert into DeletedItems(firstname) Values(@.FirstName)
Delete from Employees
Where firstname = @.Firstname
END
fetch next from c1
into @.Firstname
CLOSE c1
DEALLOCATE c1
|||Hi Adamus,
I thought 'cursors' were the bane of SQL, if I had a table with millions of rows, wouldn't this take forever? Or is this something new in 2005?
thx
|||
katgreen777 wrote: Hi Adamus,
I thought 'cursors' were the bane of SQL, if I had a table with millions of rows, wouldn't this take forever? Or is this something new in 2005?
thx
CTE is the replacement in 2005 and I'm learning to code it, but yes cursors do carry a lot of overhead. Notice I use a "fast_forward" It makes a big difference.
My argument with cursors and millions of rows is, if you're dealing with millions of rows, there is no lightning fix other than hardware enhancements. You could argue until you're blue in the face with seconds, execution plans, alternative methods, and the list goes on, only to discover a well written cursor and ample thought will give you the result you need.
Adamus
|||katgreen777 wrote: Hi Adamus,I thought 'cursors' were the bane of SQL, if I had a table with millions of rows, wouldn't this take forever? Or is this something new in 2005?
Please take a look at the code that I posted before. You don't need cursors unless your table(s) doesn't have any key information or unique identifiers or other required columns. To delete large number of rows efficiently, you can use a batching mechanism using SET ROWCOUNT or TOP. See the examples I posted. And CTEs as mentioned in the other post doesn't really help in this case. CTE is just a query expression similar to derived tables or views and provides some additional capabilities like writing recursive queries.
|||Hi,
Looking at the 'dirty' warehouse table above, wouldn't we just use the id number to delete the rows because that is unique and the firstname isnt?
thx
No comments:
Post a Comment