How do I start learning or strengthen my knowledge of data structures and algorithms?

Quora Feeds

Active Member
Pawan Bhadauria

As Robert Love said, intuitively understanding how different data structures work & when to use those is extremely important. This helps us shape decisions which ultimately impacts the performance of the program. Since this aspect has already been explained, I wanted to make an effort to illustrate this a little further using some basic examples which I feel should help create
a base for further exploration and for interviews.

At a high level we have two abstract data types namely dictionaries, containers [Priority queues are special containers since they don't have fixed order of data retrieval and storage]. In below example, I am using a container with numbers and not dictionaries.

Consider this. You are dealing with a set of million numbers. Some are already in your kitty and some are coming down the line. You picture an array of numbers in mind. Fair enough. Now since the numbers are coming in no particular order your array looks something like this:

[2,3,1,11,9,6,32,31,5…………..]

This array contains critical data and all search, insertion & deletion are frequent, specially search and insertion. So your manager wants great data structure that supports efficient search and insertion. Great. You look at your first visualization, which is an array and find time complexity for these operations as follows:

search - O(n)
insertion - O(1)
deletion - O(n) [since we don't want to create holes in array so will have to move data other wise it can be O(1)]

You think about it and ask a question. What good a O(1) insertion is when people will have to do lookup in O(n) time? You might think that this linear time doesn't look that bad for small set but it will create havoc for large data set as you might have to iterate through all items while doing the search in worst case. This clearly will not work for search which forms critical part of the program. So what do you do? You feel that until you provide some order to the data set, it will be very difficult to do anything other then linear search because it is hard to predict anything until you have gone through all items. Fair enough. So you get an idea. Why not just sort the array or keep the array sorted with data? If we do that, we can use binary search to search for an item in logarithmic time O(log n) time. Great!! So you visualize array like this now:

[1,2,3,5,6,9,11,31,32…………..]

But what happens to the maintenance cost of the array which was nothing earlier [O(1)for both insertion and deletion]? That will definitely change. Lets say for simplicity that you know how many items are there and initialized array to full size [leaving aside dynamic extension which we might have to do if data size grows], you are still faced with this situation for finding the right place to insert items and rearranging structure to keep it sorted. So both insertion and deletion become a bottleneck and result in O(n) time. At this point you can ask a question to yourself. If your program only had large number of reads but very infrequent writes, this might work for you :). But clearly this is not the case. Remember what we said "all search, insertions & deletion are frequent, specially search and insertion". So with sorted array you reached here:

search - O(log n)
insertion - O(n)
deletion - O(n)

When the time complexity of a operation changes from O(n) to O(log n), it is amazing. Try to feel it. Your program while searching for items in unsorted array of size 1 million, would take 1 million unit of time in worst case while doing linear search. Compare this with, searching in sorted array with 16 units of time (log (1 million) = 16 approx.) in worst case. This is a loose explanation but I hope it sends a signal why data structure enthusiast are crazy about logarithmic time. It is for this reason that if you do a recursive binary search, you will be able to find an item among 1 millions items in 16 recursive calls, Wow!!

[Food for mind - During Programming Interviews, when an engineer gives you a SORTED array and wants you to write a program to find an element or count number of duplicates for an item & you linearly scan through the array to accomplish your job, why should he not go MAD!!]

Coming back to our problem at hand. Clearly insertion and deletion are taking your win on search away which needs to be fixed. So you analyze the situation. What is causing insertion and deletion to take so much time? Well, it is the fixed nature of array and the fact we need to rearrange items to make room for a new one or clear room for deleted one. If you could move two items to create space for new items without impacting others, your life would be set.

[1,2,3,5,6] [7] [9,11,31,32……]

Since arrays allocate contiguous memory it is not feasible without rearranging other items. So what next? Well, you think that you probably need to move away from this fixed nature of array to a more dynamic one to accommodate new numbers without impacting others. Also today you know how many numbers you need to deal with but tomorrow that can change too.

So we start looking at pointer based data structures. First you look at singly linked list:

[1] -> [2] -> [3] -> [5] -> [6] -> [9] -> [11] -> [31] -> [32]…..

Since you can traverse singly linked list unidirectionally, you feel that it will create more problem for us, then providing solution:

search - O(n)
insertion - O(n) [If we know after which node to insert item, it can be O(1)]
deletion - O(n) [since its one way access and we need to update previous pointer]

This has clearly taken you backward rather then forward. You now look at doubly linked list. Doubly linked list seems amazing. I can go anywhere in linked list given a node. So if you need to insert a node after/before any node, it works in constant time. But things don't drastically change since search is still linear and if you don't know the node, insertion and deletion time complexity pretty much remains the same. But your data has come to a stage with following structure:


[1] <-> [2] <-> [3] <-> [5] <-> [6] <-> [9] <-> [11] <-> [31] <-> [32]…..


You feel that this pointer based data structure provide flexibility in terms of inserting/deleting a node at a particular position without impacting anything else but it can never provide an efficient search unless you have access to median node in the list which you can subsequently use for binary search. SInce list is doubly, you can go to left or right partition without issues.

Clearly, you need best of both worlds, arrays and lists. If you can take above list in your hand and bend it in middle, it can work. You also realize that once
you bend it and the bend point acts as median, you need to bend it again on left and right sides to further aid in binary search. Since items are already sorted, it will be great!!

After doing some mental wrestling and search on internet, you find that this is exactly what a binary search tree is. A bended doubly linked list, which has been converted into a tree, to provide efficient binary search.

[6]
[5] [9]
[1] <-> [2] <-> [3] [11] <-> [31] <-> [32]

....
[6]
[5] [9]
[2] [3] [31] [32]
[1] [11]


You have travelled a lot from unsorted array to binary search tree in search of optimality. What does this structure provide you? Well, if maintained optimally, it can provide you following operation time complexities:

search - O(log n) *
insert - O(log n) *
delete - O(log n) *

* - O(h) to be precise, more explanation below.

This is just amazing!! But hold on, you said "if maintained optimally". But consider following data set:

[32,31,11,9,6,5,3,2,1….]

When you start adding items to binary tree, it will choose root as 32 and will keep adding items to left subtree since all items are less than 32. What
will this do? Well, in technical terms it will make tree skewed. So essentially, the operations which you are trying to optimize are a function of the height
of the tree which in ideal case would be O(log n) [optimal height - log n] & in worst case could be O(n). So you seem right with phrase "if maintained optimally". But what will be the impact of this on our search and insertion? Since we can't predict the order in which data comes and gets inserted in binary search tree, it might result in inefficient binary search because median is skewed.

So it seems binary search tree usually works great for you but you need to rearrange it sometimes so that it doesn't become skewed. This is where we enter the domain of balanced binary trees. These trees help you keep balance of binary tree so that your median is right [or balanced to be precise] and binary search works as expected. Red-black trees, AVL trees are all example of balanced binary tree with left and right sub tree having maximum height difference of one. In practice, you might rarely use general binary search tree rather you will use R-B trees or some other self balancing trees so that your operation provide efficient operation without depending on the type of data or ordering. It is for this reason that Java implements all tree collections using R-B tree.

So you reached to a place in your voyage where you have a data structure (balanced binary trees) which are guaranteed to provide logarithmic time complexity for all operations. Congrats!!

Now lets say you build binary search tree which is efficient, self balancing and provides you majority of operations in logarithmic time. This is incredible.
Now your manager comes to you and asks you do modify things to find minimum and maximum in this set of elements. You tell him that, he can traverse through binary search tree and get the left most element in tree which is minimum and right most which is maximum.

But now he tells you that, he doesn't need extensive searching capability but just needs to extract item with maximum value. This value is critical as it will
determine what needs to be done next. As soon as someone tells you to do repeated looks ups for minimum or maximum, an alarm should set ringing.
Whether it is scheduling jobs with highest priority or other processing systems, this concept relates to priority queue or heaps. Heaps are incredible. They
provide you amazing capability to extract min or max in logarithmic time and can be implemented using an array or array based list. They internally shift up or down the elements based on priorities and set stage for extraction in logarithmic time [You might want to look at this question where I have provided min and max heap implementation: How can a max heap be converted to a min heap efficiently? and Pawan Bhadauria's answer to Why is heap sort used? where I explained details of heaps and heap sort]. You replace your container with Heap which is what your manager is probably looking for. Both max extraction and insertion still can be done logarithmic time. [I can go a little further and relate this to other concepts but would force myself to stop here :)]

I tried to explain this a little slowly and might have forgotten few intricacies but the idea is that when someone asks you to give an algorithm, you can provide efficient one only if you have visualized and pondered about pros and cons of various data structure you have in your arsenal. After practicing for a while, you will not even think of intermediate steps and will reach directly to the right data structure. But from the perspective of basic understanding, it makes sense to see how everything fits together.

I didn't take example of key-value pair data structures aka dictionaries/hash-tables but they form backbone of computer programs & its worth the effort to understand those. Because they provide constant retrieval/insert of value based on keys [yes I said CONSTANT], they are incredibly fast. Few things could have changed had I chosen different data set and taken this path
but I will leave it to you to ponder about it.

See Questions On Quora

Continue reading...
 
Top