I went through this recently posted article on Median of Medians - http://functionspace.org/articles/19 . It's well explained, but I think there can be a better approach to attack the problem defined in the article. My friend discussed about k node heap and other external sorting algorithms. What would be the best way to approach the problem? What if you have much bigger data? Would median of medians be suitable for that too?
Looking forward to this discusion.
Hi Akhilendra,
Let's analyze how to find the $k^{th}$ largest element using a heap! Building a max binary heap takes O(n). Then taking out an element one by one k times would take klog(n) time. Hence the total time complexity would be O(n + klog(n)) . It depends on the value of k, if k is of the order of O(n), then it takes O(n*log(n), if k is small , let's say O(log(n)), then it take O(n). It might win against median of medians if k is very small, but the asymptotically, both would remain equally complex.
With a view towards implementation in real systems, if the data gets too huge to fit into main memory of the system, external sorting algorithms might be a better choice.
By the way, the example was written to illustrate median of medians as that was the algorithm we wished to discuss in the article.
Thanks Mohit! Clarified things for me :)
Hi,
@Mohit, the approach medians for medians in better only if the data in static or its not coming from a real time streaming. But for this basic case any other approach also do the same stuff, use external sorting, but usually these scenarios we are dealing in our projects almost daily where your data is coming from running stream and u need to manipulate that data, for that its not a valid approach at all.
Yes Mike, you are right! In the case of dynamic data like incoming streaming data, median of medians might not be a good algorithm, since it would have to calculate the approximate median again!
@Mike I think the problem definition changes when you talk about streaming data, since there's no notion of a kth largest element when you don't know what element might come next. It can only be deterministic if you know the set of numbers in advance i.e. a static set.
If your application has streaming data, and you want to find out kth largest element at any given instance of time one approach could be keeping the list sorted and inserting the new entry in the list at its right place when it arrives. This will be an O(n) operation too. Other approach could be heaps as discussed earlier.
Right now I'm unable to fathom anything that will take lesser than this for such a scenario. Although, I'm unsure about approximation algorithms for this particular case that might do better.