Group: pgsql.hackers


Subject: VLDB Features
From: tgl@sss.pgh.pa.us (Tom Lane)
Date: 12/14/2007 6:22:41 PM
Neil Conway <neilc@samurai.com> writes: > By modifying COPY: COPY IGNORE ERRORS or some such would instruct COPY > to drop (and log) rows that contain malformed data. That is, rows with > too many or too few columns, rows that result in constraint violations, > and rows containing columns where the data type's input function raises > an error. The last case is the only thing that would be a bit tricky to > implement, I think: you could use PG_TRY() around the InputFunctionCall, > but I guess you'd need a subtransaction to ensure that you reset your > state correctly after catching an error. Yeah. It's the subtransaction per row that's daunting --- not only the cycles spent for that, but the ensuing limitation to 4G rows imported per COPY. If we could somehow only do a subtransaction per failure, things would be much better, but I don't see how. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match

Subject: VLDB Features
From: tgl@sss.pgh.pa.us (Tom Lane)
Date: 12/14/2007 8:03:56 PM
Neil Conway <neilc@samurai.com> writes: > One approach would be to essentially implement the pg_bulkloader > approach inside the backend. That is, begin by doing a subtransaction > for every k rows (with k = 1000, say). If you get any errors, then > either repeat the process with k/2 until you locate the individual > row(s) causing the trouble, or perhaps just immediately switch to k = 1. > Fairly ugly though, and would be quite slow for data sets with a high > proportion of erroneous data. You could make it self-tuning, perhaps: initially, or after an error, set k = 1, and increase k after a successful set of rows. > Another approach would be to distinguish between errors that require a > subtransaction to recover to a consistent state, and less serious errors > that don't have this requirement (e.g. invalid input to a data type > input function). If all the errors that we want to tolerate during a > bulk load fall into the latter category, we can do without > subtransactions. I think such an approach is doomed to hopeless unreliability. There is no concept of an error that doesn't require a transaction abort in the system now, and that doesn't seem to me like something that can be successfully bolted on after the fact. Also, there's a lot of bookkeeping (eg buffer pins) that has to be cleaned up regardless of the exact nature of the error, and all those mechanisms are hung off transactions. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings

Subject: VLDB Features
From: tgl@sss.pgh.pa.us (Tom Lane)
Date: 12/15/2007 1:12:58 AM
Josh Berkus <josh@agliodbs.com> writes: > There's no way we can do a transactionless load, then? I'm thinking of the > load-into-new-partition which is a single pass/fail operation. Would > ignoring individual row errors in for this case still cause these kinds of > problems? Given that COPY fires triggers and runs CHECK constraints, there is no part of the system that cannot be exercised during COPY. So I think supposing that we can just deal with some simplified subset of reality is mere folly. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate

Subject: VLDB Features
From: tgl@sss.pgh.pa.us (Tom Lane)
Date: 12/15/2007 2:44:56 AM
NikhilS <nikkhils@gmail.com> writes: > Any errors which occur before doing the heap_insert should not require > any recovery according to me. A sufficient (though far from all-encompassing) rejoinder to that is "triggers and CHECK constraints can do anything". > The overhead of having a subtransaction per row is a very valid concern. But > instead of using a per insert or a batch insert substraction, I am > thinking that we can start off a subtraction and continue it till we > encounter a failure. What of failures that occur only at (sub)transaction commit, such as foreign key checks? regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend

Subject: VLDB Features
From: tgl@sss.pgh.pa.us (Tom Lane)
Date: 12/16/2007 12:27:36 PM
Hannu Krosing <hannu@skype.net> writes: > But can't we _define_ such a subset, where we can do a transactionless > load ? Sure ... but you'll find that it's not large enough to be useful. Once you remove all the interesting consistency checks such as unique indexes and foreign keys, the COPY will tend to go through just fine, and then you're still stuck trying to weed out bad data without very good tools for it. The only errors we could really separate out without subtransaction fencing are extremely trivial ones like too many or too few fields on a line ... which can be caught with a sed script. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match