[ajug-members] Deleting Duplicate Records using JAVA and JDBC!
Joe Sam Shirah
joe_sam at bellsouth.net
Wed Dec 13 07:56:02 EST 2006
Hi Hammond,
> Doing it on the database server side would be the most efficient.
I completely agree with Sid. That's true of any database access
language, not just JDBC. Here are some links that paid attention to
performance ( again assuming Oracle ):
http://www.jlcomp.demon.co.uk/faq/duplicates.html
http://www.pythian.com/blogs/209/oracle-deleting-duplicate-rows-efficiently
An alternative, if you can afford the downtime and there are no side
issues, is to use one of the above solutions ( or just DISTINCT if that's
acceptable ) to gen a new table, delete the old and rename, then rebuild
indexes. HTH,
Joe Sam
Joe Sam Shirah - http://www.conceptgo.com
conceptGO - Consulting/Development/Outsourcing
Java Filter Forum: http://www.ibm.com/developerworks/java/
Just the JDBC FAQs: http://www.jguru.com/faq/JDBC
Going International? http://www.jguru.com/faq/I18N
Que Java400? http://www.jguru.com/faq/Java400
----- Original Message -----
From: "Sid Womac" <sid.womac at ngc.com>
To: "General AJUG membership forum (100-200 messages/month)"
<ajug-members at ajug.org>
Sent: Tuesday, December 12, 2006 8:38 PM
Subject: Re: [ajug-members] Deleting Duplicate Records using JAVA and JDBC!
> hsaid at checkfree.com wrote:
>
> > 3- Option is to use the something like this query to delete multiple
> > records:
> >
> > delete from RESPONSE_DATA m1
> > where m1.rowid > (select min(m2.rowid) from RESPONSE_DATA m2 where m1.
> > RECORD_ID=m2.RECORD_ID
> > );
>
> Doing it on the database server side would be the most efficient.
> (Given the mention of 'rowid' above, I assume that Oracle is the db
> implementation). Databases are optimized to perform operations such as
> these (subqueries, mass deletions, etc).
>
> The downside with either Option 1 or 2, the programmatic approach, is
> that they must pull all of that data, all 100+ million records, across
> the network. Also, as you mention, there are all of the db
> transactions, or client side memory usage. So the db server must read
> the data into memory from disk, send it across the network, to be loaded
> into memory on the client side, who then operates on it and sends a
> subset back across the network to the db. More work on the db side,
> then finally committed back to disk (or the transaction logs, at least).
>
> All of the above approaches would work, but I think the Db approach puts
> the least strain on resources.
>
> You might consider a Unique constraint to RECORD_ID in the future, but I
> realize that all business logic is unique, and there may be a reason
> this is not feasible.
>
> cheers
>
> Sid
>
>
More information about the ajug-members
mailing list