Subject: Replacement Selection
From: tgl@sss.pgh.pa.us (Tom Lane)
Date: 11/26/2007 12:00:46 PM
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> mac_man2005@hotmail.it wrote:
>> Unfortunately I'm lost into the code... any good soul helping me to
>> understand what should be the precise part to be modified?
> I think you should print the file and read it several times until you
> understand what's going on. Then you can start thinking where and how
> to modify it.
Also, go find a copy of Knuth volume 3, because a whole lot of the
comments assume you've read Knuth's discussion of external sorting.
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster
Subject: Replacement Selection
From: tgl@sss.pgh.pa.us (Tom Lane)
Date: 11/26/2007 1:31:01 PM
<mac_man2005@hotmail.it> writes:
> 3) Start run generation. As for this phase, I see PostgreSQL code (as Knuth
> algorithm) marks elements belonging to runs in otder to know which run they
> belong to and to know when the current heap has finished building the
> current run. I don't memorize this kind of info. I just output from heap to
> run all of the elements going into the current run. The elements supposed to
> go into the next run (I call them "dead records") are still stored into main
> memory, but as leaves of the heap. This implies reducing the heap size and
> so heapifying a smaller number of elements each time I get a dead record
> (it's not necessary to sort dead records). When the heap size is zero a new
> run is created heapifying all the dead records currently present into main
> memory.
Why would this be an improvement over Knuth? AFAICS you can't generate
longer runs this way, and it's not saving any time --- in fact it's
costing time, because re-heapifying adds a lot of new comparisons.
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match
Subject: Replacement Selection
From: tgl@sss.pgh.pa.us (Tom Lane)
Date: 11/26/2007 5:17:07 PM
<mac_man2005@hotmail.it> writes:
> Anyway, even in my RS implementation a longer run is created. The first M
> initialization elements will surely form part of the current run. M is the
> memory size so at least a run sized M will be created. After initialization,
> the elements are not suddenly output, but an element from heap is output
> into run as soon as I get an element from stream. In other words, for each
> element from stream, the root element of the heap is output, and the input
> element takes the root place into the heap. If that element is a "good
> record" I just heapify (since the element will be placed at the now free
> root place). If that input element is a dead record I swap it with the last
> leaf and reduce the heap size.
AFAICS that produces runs that are *exactly* the same length as Knuth's
method --- you're just using a different technique for detecting when
the run is over, to wit "record is not in heap" vs "record is in heap
but with a higher run number". I guess you would save some comparisons
while the heap is shrinking, but it's not at all clear that you'd save
more than what it will cost you to re-heapify all the dead records once
the run is over.
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend
Subject: Replacement Selection
From: tgl@sss.pgh.pa.us (Tom Lane)
Date: 11/26/2007 5:55:48 PM
Gregory Stark <stark@enterprisedb.com> writes:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>> I guess you would save some comparisons
>> while the heap is shrinking, but it's not at all clear that you'd save
>> more than what it will cost you to re-heapify all the dead records once
>> the run is over.
> This sounded familiar... It sounds a lot like what this CVS log message is
> describing as a mistaken idea:
Wow, I had forgotten all about that; but yeah this sounds exactly like
my first-cut rewrite of PG's sorting back in 1999. I have some vague
memory of having dismissed Knuth's approach as being silly because of
the extra space and (small number of) cycles needed to compare run
numbers in the heap. I hadn't realized that there was an impact on
total number of comparisons required :-(
The discussion from that time period in pgsql-hackers makes it sound
like you need a large test case to notice the problem, though.
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly
|