How hard can it be to fill up a DVD?

 

 

    (September 2006)

"Yeah, right - as if it's difficult to download a bunch of useless junk from the web and fill up a coaster. Magazines do it all the time!"

Actually, that's not what I meant. This is a developer's site, after all :-)

My daily routine - as a developer writing code for a living - involves a morning visit to slashdot and an afternoon glimpse at the digg. These joyful trips usually end up with me setting up "stuff" to be downloaded during the day, so that I can take them back home for "research". I have a directory set-up in my PC (at work) for these "R&D" materials, and periodically empty it to a rewritable DVD I carry back and forth between work and home.

Things being what they are, work goes around in circles between normal routine and crazy keyboard-hammering. After one of those keyboard destroying months, I realized that "R&D" things had been piling up, and decided to burn them and take them home.

More than 4.3GB were patiently waiting in the research folder.

"Oh well, I'll just take them in two passes - one today and one tomorrow."

And at that point, the question formed in my mind:

What would be the optimal way of doing that, filling up a DVD with a myriad files, making sure that it gets filled up as much as possible?

Think about it. When you have more than a hundred files, there are literally millions of combinations you can try. Which one is the best?

Is it really millions? Since Python allows rapid creation of prototypes with the least amount of effort, I decided to give it a "brute-force" try:

#!/usr/bin/env python import sys g_maxSize=4480 # we'll count in megabytes, and a DVD is 4480M def main(): # the sets formed from the combinations collectionsToTest = [] # will be stored in this variable # Now read integers from STDIN for newNo in sys.stdin.readlines(): try: # Each integer is the size newNo = int(newNo) # in MB of one of the files except: print "%s is not numeric" % newNo continue newWing = [x+[newNo] for x in collectionsToTest] collectionsToTest.append([newNo]) collectionsToTest.extend(newWing) # Now check all the sets maximumLessThanMaxSize = 0 # and find the best one for set in collectionsToTest: # (the one with the closest total = 0 # sum to g_maxSize) for elem in set: total += elem if total > g_maxSize: continue if total > maximumLessThanMaxSize: maximumLessThanMaxSize = total print "Set:", set, "maximum:", \ maximumLessThanMaxSize if __name__ == "__main__": main()
main is split in two blocks. The first one creates all possible sets you can form out of a series of numbers, read from standard input. The second one checks each of these sets to find the best set (the one that gets as close as possible to 4480MB).

Each new number that is input increases the sets to test: the new collectionsToTest is made of:

  • the previous collection of sets
  • the new element, in a single number set
  • each one of the sets of the previous collection augmented to include the new number
So, if you input numbers 4400,10,50,70, you get the following collectionsToTest:
[4400]          (4400 is input)
[10]            (10 is input)
[4400, 10]
[50]            (50 is input)
[4400, 50]
[10, 50]
[4400, 10, 50]
[70]            (70 is input)
[4400, 70]
[10, 70]
[4400, 10, 70]
[50, 70]
[4400, 50, 70]
[10, 50, 70]
[4400, 10, 50, 70]
..and the following result from the prototype:
Set: [4400] maximum: 4400
Set: [4400, 10] maximum: 4410
Set: [4400, 50] maximum: 4450
Set: [4400, 10, 50] maximum: 4460
Set: [4400, 70] maximum: 4470
Set: [4400, 10, 70] maximum: 4480
Revisiting the set creation: each new number input to a previous collection of N sets, is adding a set of one element (with the new member) and N new sets (the previous sets, augmented to include the new number). This means that we go from N elements to N+1+N elements. The work to perform therefore grows faster than 2^N, and that's really fast... After just 20 numbers, we have millions of combinations to test:
bash$ perl -e \
    'for($i=1; $i<20; $i++) { print $i."\n"; }' | \
    ./simpleDVD.py

(you'll run out of memory, you better Ctrl-C 
 while you're ahead)
And I know that my "R&D" directory has hundreds of .zip files, not just 20...

Dynamic programming

If you 've never heard of dynamic programming and look it up in Wikipedia, you might get scared from its technical description. You shouldn't. Its actually quite easier to figure out what it is when you look at it from our - somewhat limited - viewpoint.

Our problem (optimally choosing a subset of integers from a given set to fill up a fixed size container) is one of a class of problems that share a common trait: they are difficult to solve in brute-force, but they allow you to easily find an optimal solution for problem N+1 if you know the solution for problem N.

Let's look at it in detail for our problem. We'll build a table:

Dynamic programming solver for my "R&D" directory migration ;-)
  1 2 3 ... 198
1 MB          
2 MB          
3 MB          
...
4480 MB          

The row index provides the container size: we start from 1MB and move all the way up to 4480MB. The column index is the number of files we use from my directory to fill up the container. I currently have a total of 198 files in the directory, so the table has 198 columns. Notice that we have arranged the files in a particular order, so

  • in column j, we'll store the optimal result using only the first j files
  • in column j+1, the optimal result using only the first j+1 files
  • etc...
This means that the cell at the bottom right will contain the optimal sum we can get out of our files. To be more precise: the value in cell (N,M) is the optimal size we get - that is, the sum of the file sizes of the optimal set for a container of size N using the first M files - so it will always be less than (or equal) to the row index, N (which is the size of the container).

Now imagine that we have filled up all lines up to line j, and all cells of line j+1 up to cell k. How do we fill cell (j+1, k+1)? That is, given that...

  • line j provides us with the best results for a container of size j,
  • and that we have somehow managed to figure out the optimal results for a container of size j+1 up to using the first k files,
...how do we fare with the extra megabyte provided by line j+1 if we also use file k+1?

In order to fill cell (j+1,k+1) of line j+1, we need to see whether the optimal sum of filesizes is the same as it was using the k first files (i.e. the same as in cell (j+1,k)), or whether we can use the extra file (file k+1) and increase space coverage by adding it to the mix.

To do that, we

  • first check to see if file k+1 can be fitted in a container of size j+1. If not, we copy (j+1, k) to (j+1, k+1) - that is, the optimal result stays the same in this line between columns k and k+1, since file k+1 can't fit in j+1 MB.
  • if it can fit, we check if we can increase the space covered by using file k+1 (compared to column k, which uses only the first k files). This is the most complicated part, so pay attention: The optimal space used if we include file k+1 can be found by looking at the optimal result using the first k files with a container of size (j+1)-filesize(k+1) (which tells us the best attainable result for the remaining space after we use file k+1 in the j+1 container, and are thus left with (j+1)-filesize(k+1) space) and adding the size of file k+1 to the result.
If we can't improve the optimal result of the previous column (by using file k+1), we just copy it (we use only the first k files). It's really quite simple: if you can't do it any better, keep the previous result.
#!/usr/bin/env python import sys g_maxSize=4480 def main(): data = [] for key in sys.stdin.readlines(): try: key = int(key) data.append(key) except: print "%s is not numeric" % key continue print "Total of ", len(data), "items" print data dynamic = [] # Build our table of N x M, with pairs of (0,0) as elements for i in xrange(0, g_maxSize+1): dynamic.append([]) for j in xrange(0, len(data)+2): # optimal result: 0, last step to get there: 0 dynamic[i].append([0,0]) # all files 1..j for j in xrange(1, len(data)+1): # all sizes up to g_maxSize for w in xrange(1, g_maxSize+1): if data[j-1] > w: # file j won't fit in container, # so copy best from j-1 files dynamic[w][j][0]=dynamic[w][j-1][0] dynamic[w][j][1]=dynamic[w][j-1][1] else: # file j fits in this container, # but does it improve things? if dynamic[w][j-1][0] >= \ dynamic[w-data[j-1]][j-1][0] + data[j-1]: # No, it doesn't. dynamic[w][j][0] = dynamic[w][j-1][0] dynamic[w][j][1] = 0 # dummy last step else: # Well it does! Update the optimal result # and the last step dynamic[w][j][0] = \ dynamic[w-data[j-1]][j-1][0] + data[j-1] dynamic[w][j][1] = data[j-1] print "Attainable: ", dynamic[g_maxSize][len(data)][0] total = 0 line = g_maxSize pieces = len(data) while total < dynamic[g_maxSize][len(data)][0]: total += dynamic[line][pieces][1] print "+", dynamic[line][pieces][1], "=", total line = line-dynamic[line][pieces][1] pieces-=1 if pieces == 0: break if __name__ == "__main__": main()
You might wonder: how does one go from finding the optimal attainable result to how to get there?

Easy: you don't just keep the optimal sum, but also the last step you used to get there. To reproduce the steps that get you to the final (optimal) result, you go backwards, subtracting the last step each time. In the dynamic table above, we keep the two values (optimal result and last step to get there) in [0] and [1].

Here's a way to test this (with the same values as in our brute force test):

bash$ ./dynamic.py
4400
10
50
70
(Ctrl-D)

Total of  4 items
[4400, 10, 50, 70]
Attainable:  4480
+ 70 = 70
+ 0 = 70
+ 10 = 80
+ 4400 = 4480
...and here's something that would choke the brute force implementation (more than 2^100 sets!) but takes a couple of seconds to finish with dynamic programming: the numbers from 1 to 99 (adding up to 99*100/2 = 4950, so yes, above our target of 4480 - something will be chopped out)
bash$ perl -e 'for($i=1; $i<100; $i++) { print $i."\n"; }' | \
	./dynamicDVD.py
Total of  99 items
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 
18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 
33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 
48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62,
63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 
78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 
93, 94, 95, 96, 97, 98, 99]
Attainable: 4480
+ 0 = 0
+ 0 = 0
+ 0 = 0
+ 0 = 0
+ 95 = 95
+ 94 = 189
+ 93 = 282
...
...
+ 4 = 4474
+ 3 = 4477
+ 2 = 4479
+ 1 = 4480
It works.

Ain't it great? I can now optimally fill my DVDs with junk.

Download an easy to use version here.
Usage is as plain as it gets:

  bash$ fitToSize.py inputDirectory/ outputDirectory/ 4475

  (you can choose the desired capacity, 4475 is just an example)

P.S. I know that using MBs to count is cheating: the correct solution is to use the allocation block of the filesystem as a step size between lines. Hey, close enough! (not to mention it would also make the table a lot taller.)


Back to homepageLast update on: Wed Oct 22 13:47:23 2008 (Valid HTMLValid CSS)