|
|
Subject: VLDB Features
From: tgl@sss.pgh.pa.us (Tom Lane)
Date: 12/14/2007 6:22:41 PM
Neil Conway <neilc@samurai.com> writes:
> By modifying COPY: COPY IGNORE ERRORS or some such would instruct COPY
> to drop (and log) rows that contain malformed data. That is, rows with
> too many or too few columns, rows that result in constraint violations,
> and rows containing columns where the data type's input function raises
> an error. The last case is the only thing that would be a bit tricky to
> implement, I think: you could use PG_TRY() around the InputFunctionCall,
> but I guess you'd need a subtransaction to ensure that you reset your
> state correctly after catching an error.
Yeah. It's the subtransaction per row that's daunting --- not only the
cycles spent for that, but the ensuing limitation to 4G rows imported
per COPY.
If we could somehow only do a subtransaction per failure, things would
be much better, but I don't see how.
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match
Subject: VLDB Features
From: tgl@sss.pgh.pa.us (Tom Lane)
Date: 12/14/2007 8:03:56 PM
Neil Conway <neilc@samurai.com> writes:
> One approach would be to essentially implement the pg_bulkloader
> approach inside the backend. That is, begin by doing a subtransaction
> for every k rows (with k = 1000, say). If you get any errors, then
> either repeat the process with k/2 until you locate the individual
> row(s) causing the trouble, or perhaps just immediately switch to k = 1.
> Fairly ugly though, and would be quite slow for data sets with a high
> proportion of erroneous data.
You could make it self-tuning, perhaps: initially, or after an error,
set k = 1, and increase k after a successful set of rows.
> Another approach would be to distinguish between errors that require a
> subtransaction to recover to a consistent state, and less serious errors
> that don't have this requirement (e.g. invalid input to a data type
> input function). If all the errors that we want to tolerate during a
> bulk load fall into the latter category, we can do without
> subtransactions.
I think such an approach is doomed to hopeless unreliability. There is
no concept of an error that doesn't require a transaction abort in the
system now, and that doesn't seem to me like something that can be
successfully bolted on after the fact. Also, there's a lot of
bookkeeping (eg buffer pins) that has to be cleaned up regardless of the
exact nature of the error, and all those mechanisms are hung off
transactions.
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings
Subject: VLDB Features
From: tgl@sss.pgh.pa.us (Tom Lane)
Date: 12/15/2007 1:12:58 AM
Josh Berkus <josh@agliodbs.com> writes:
> There's no way we can do a transactionless load, then? I'm thinking of the
> load-into-new-partition which is a single pass/fail operation. Would
> ignoring individual row errors in for this case still cause these kinds of
> problems?
Given that COPY fires triggers and runs CHECK constraints, there is no
part of the system that cannot be exercised during COPY. So I think
supposing that we can just deal with some simplified subset of reality
is mere folly.
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at
http://www.postgresql.org/about/donate
Subject: VLDB Features
From: tgl@sss.pgh.pa.us (Tom Lane)
Date: 12/15/2007 2:44:56 AM
NikhilS <nikkhils@gmail.com> writes:
> Any errors which occur before doing the heap_insert should not require
> any recovery according to me.
A sufficient (though far from all-encompassing) rejoinder to that is
"triggers and CHECK constraints can do anything".
> The overhead of having a subtransaction per row is a very valid concern. But
> instead of using a per insert or a batch insert substraction, I am
> thinking that we can start off a subtraction and continue it till we
> encounter a failure.
What of failures that occur only at (sub)transaction commit, such as
foreign key checks?
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend
Subject: VLDB Features
From: tgl@sss.pgh.pa.us (Tom Lane)
Date: 12/16/2007 12:27:36 PM
Hannu Krosing <hannu@skype.net> writes:
> But can't we _define_ such a subset, where we can do a transactionless
> load ?
Sure ... but you'll find that it's not large enough to be useful.
Once you remove all the interesting consistency checks such as
unique indexes and foreign keys, the COPY will tend to go through
just fine, and then you're still stuck trying to weed out bad data
without very good tools for it. The only errors we could really
separate out without subtransaction fencing are extremely trivial
ones like too many or too few fields on a line ... which can be
caught with a sed script.
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match
|