Welcome to the real world
2007-01-09
Many times, the hardest part of creating a computer program is to write the code that deals with user input in particular or the real, outside world in general. Let’s suppose you are coding a program that calculates the inverse matrix of a given one. You may have a smart and elegant algorithm to do that work, and you can create a very beautiful program to implement it. However, the program’s elegance will be spoiled by the code that deals with reading the matrix from a file or from the keyboard. That code is ugly because it’s not a controlled environment. Once you have read the matrix, you have the control. You start operating and, if you do everything right, you only have to catch a few possible errors, if any.
On the other hand, the user may give you anything. You ask them to give you the matrix dimension and the matrix values, and anything can fail. They may input text that is not a number, or a too big number, or too small. Or something that starts with a number but has more text at the end. Or you can reach EOF while reading the matrix data, or its dimensions. What if you read all the values correctly but there’s more text in the file? Do you consider that an error or maybe not, so the user can write comments in the files that store the matrixes? What about displaying a warning? Let’s face it: that code is not beautiful. There are beautiful and smart programs which are all about dealing with user input, like interpreters or compilers, but those are the exceptions. And, even then, the code is many times ugly anyway.
So, how to write good-looking code to deal with user input? I’ll give you my opinion about it. First off, it’s much better when the programming language’s standard library has high level functions to do the work. Ok, usually there’s not a function to say “Read a matrix from this file”. But let’s suppose it has a function that can tell you if a string represents an unsigned integer (from the beginning to the end) or a floating point number, and to split a string using blank space as separators. And some functional programming tools to transform lists. You can read the full file to a string and split it. See if the matrix dimensions are valid quite easily, see if you have enough elements to read, and if all of them are valid floating point numbers, and you can create a matrix easily from that list.
The more high level the functions are, the better. A good example of this is the getopt library, to deal with the always unpleasing task of processing command line options. In particular, the Python implementation of it in the module optparse is very flexible and powerful.
Exceptions can also help in these situations. There are people in favour and against exceptions. It’s somehow controversial. Many people prefer functions that return status codes, and do not like exceptions because catching them is slower, and designing and using the different program layers is more complicated. Other people like them, but warn you about creating programs that handle errors with exceptions. If the program has several layers, they tell you, you must create a set of exceptions for each layer, but you should never, under any circumstances, let an exception appear if it’s not in the layer just above. This is not as simple as it sounds. For example, in my experience programming Putmail I found out that the smtplib Python module sometimes raises exceptions from the socket module. That shouldn’t happen, as socket exceptions should be dealt internally inside the smtplib module, and be transformed into smtplib exceptions. In any case, exceptions let you write naive code that supposes everything is correct, without dealing with any error situation, followed by a block in which you capture possible problems. That code looks more elegant, in my opinion, than dealing with exit status codes and having error checks all over your code, which is the usual situation with user input.
In general, it’s not about user input only. It’s about problems that can’t be modeled easily. Being given a large text and seeing if the text fits a very particular format is not a problem that can be easily modeled mathematically. Sometimes you are asked to build a program that needs a database, but the real world situation of the entities and relationships between them can’t be translated easily to an ER model, and you end up with a very ugly database design. And let’s not forget operating system kernels, which have to deal with (sometimes faulty) real world hardware all the time, thousands of different devices and subsystems that try to interact nicely, internal APIs designed to be good enough for a given subsystem, until a new type of device is invented and you need to include more functions, or redesign that internal layer. I’m sure that’s a pain in the ass.
You may know I’m the author of youtube-dl, a very simple Python script to download videos from YouTube.com. This program is essentially about dealing with the real world. Accessing web servers and retrieving data. It’s all user interaction. I won’t say its code is good, because it’s not. But I think I managed to keep it sane, at least. And this is thanks to Python having a module called urllib2 that does all the hard work for me, and internally deals with many different real world errors that are filtered and presented to me in the shape of a very simple group of exceptions I can deal with easily.
Some days ago I made a small code change which inspired me to write about this topic. In previous versions, I downloaded the video data in 10 KB chunks. I chose that block size because it was good enough for dial up connections, and if you had a very fast connection capable of downloading at 300 or 600 KB per second you only needed to iterate 30 or 60 times per second, which is not too much. That was the rationale behind choosing that number. However, I wanted to make youtube-dl more robust and make it able to deal with any connection speed more or less gracefully. I kept 10 KB as the initial block size, but I tried to implement dynamic block sizes. My objective was to make one iteration per second. I would measure the time it takes to download one block. Dividing the block size by the time it took to download it, I expected to find the download speed in bytes per second. As I wanted one iteration per second, the next time I needed to try to download a block of that size. If the speed was 100 KB per second, I’d try to download a 100 KB block next time. It didn’t work.
With my code, I had just stepped over to the real world and I was there to find a painful reality. In the time it takes to process one block, part of the next is being downloaded. Also, the operating system sits in between my application and the network, which as also bad for my initial idea. The result is that, sooner or later (usually sooner), I would ask for a block size that was already available for the most part. The measured time would be very small, which led the program to believe I was able to download several MB per second. Nice thing this happened more sooner than later, because this allowed me to quickly spot the problem in the first test I performed. In the end, the changes I introduced were not very ugly. I put a limit in the speed the block size could change. The size could be doubled or halved, at most, in the next iteration. This allows for exponential growth, which is a very fast change and can adapt quickly to any network connection speed in a handful of iterations, while at the same time providing a filter so I wouldn’t hit that faulty behaviour. As I said, not too ugly, but the initial ideal model failed to represent the problem appropriately.
By the way, don’t you hate it when you are writting a loop (any loop) and you realize you have to put the first iteration out of it? That gets on my nerves.
Cifras y Letras (2)
2007-01-07
In the previous post, I mentioned the numeric rounds from the Spanish TV show Cifras y Letras. I will talk about the other side of the coin today.
In the letter rounds, the contestans are given 9 letters and they need to find the longest possible word they can using those letters. The letters are not chosen in a fully random way. Each contestant, in turns, asks to be given a vowel or a consonant, until the 9 letters are completed. As there are only 5 vowels and they usually ask for 4 or 5 vowels, it’s not uncommon to have two or even three repeated vowels. If they are given two E’s, for example, they can make words with two E’s, or maybe only one, but not three. Very simple actually.
I read about a quite interesting algorithm to solve this problem as fast as possible in Pedro Reina’s website. Let’s suppose you have a dictionary in a database, with all the words that are valid for the game (plurals are not accepted unless they are present explicitly in the dictionary, and the only verb forms accepted are infinitive, gerund and past participle). If you were given 9 letters, an initial solution everybody can give is to generate all the permutations for those 9 elements, skipping repeated permuations, followed by the permutations of 8 elements, etc. The number of possible permutations is as much as (9! + 8! + 7! + 6! + 5! + 4! + 3! + 2! + 1!). That is, 409113. Quite a lot of permutations that have to be checked against the dictionary database to find available words.
But Pedro Reina and others had a very good idea, preprocessing the dictionary database first. For each word with 9 or less characters, which are the only interesting ones for this game, you sort its letters to obtain and store a “key” for that word. For example, the word “three” is associated with the key “eehrt”. If you do that, you’ll find that a key is usually shared between several words. These keys allow for a reverse lookup. That is, take the given 9 letters, sort them, and search for words with that key in the database. With only one search, that query will return every possible 9-letter word that can be formed with those 9 letters, if there’s any. Instead of having to generate 9! permutations, you only have to do 1 search.
If you have noticed, this allows a jump from permutations to combinations. The total number of combinations you need to form is as much as 511. Forming 511 words is much faster than forming 409113, and easier in my opinion. You sort the initial 9 letters and that’s the only 9-letter combination. By removing each one of those letters you get the as much as 9 8-letter combinations. For each of those, you try to remove one more, etc. Generating permutations is slightly more complicated.
On top of that, as they usually choose 4 or 5 vowels, the probability of having at least a repeated vowel is quite high. In 4 vowels, this probability is 80.8%, and in 5 vowels, it’s 96.16%. This means it’s very usual to have less than 511 combinations.
Update: I have coded a very small C++ program to generate the combinations. Its code is public domain and I can send it on request. It’s only 100 lines long and obviously lacks any database code, as I don’t have an electronic dictionary.
Cifras y Letras
2007-01-04
There’s a Spanish TV show called Cifras y Letras. Literally, Figures and Letters. In it, two contestants play several games. The one who has more points at the end, wins. The two main types of games are the Figures and Letters rounds. In each show, they play 4 rounds of Figures and 8 rounds of Letters. I will be talking about the Figures game.
In it, the contestants are given 6 random numbers from the set {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 25, 50, 75, 100}. Then, they are asked to combine them arithmetically, using only natural numbers and exact operations, in order to find another random number from the range [100, 999]. One example: given the numbers (5, 5, 6, 25, 9, 7), obtain the number 881. Only sums, substractions, multiplications and divisions are allowed. One possible answer is ((7 * 5 * 26) + 6). Or, step by step:
- 7 * 5 = 35
- 35 * 25 = 875
- 875 + 6 = 881
Some days ago I coded two programs that solve this problem. One of them in Python, and I made an almost literal translation of the code to C++. The C++ version takes up more space, obviously, because the syntax is not as flexible, and data structures such as lists are not handled natively, but implemented in the standard library. However, in my own tests it runs more than 10 times faster than the Python version. Both programs are public domain code and I can mail them on request. Don’t hesitate to ask for any of the two, especially if you can’t get the C++ tarball.
Internally, I represented the problem as a graph. Starting from a root node, I generate partial solutions as childs of that node, recursively. It’s not a tree, however. Each graph node holds three important pieces of data. First, the list of operations needed to reach that node from the root node. Second, the number that was introduced when generating that node. Third, a list of remaining numbers to be operated, in which the introduced number is present.
To generate the list of children from a given node, I essentially generate all the possible pairs of numbers operated in every possible way, discarding invalid operations. Each child node has a new list of operations identical to the one from the parent node, plus one more operation. The result of that operation is the new number introduced in the child node, and the new list of numbers stems from the one of the parent, in which the pair of numbers operated is replaced with the result of the operation.
This graph is explored recursively and fully applying backtracking in order to find the shortest solution. Two small optimizations are used in both versions, and one small drawback is present in the C++ version. First, the list of numbers are always kept sorted in descending order, because this eases forming valid operations. Second, when we visit a node with a given list of numbers, that list is added to a set, so that if the same list is generated twice, it’s skipped. The drawback present in the C++ version is the way of storing those lists of numbers. The Python version benefits from having a native hash table implementation. By using number lists (or touples, to be precise) as the hash table key, and boolean values, you can easily and efficiently represent a set. However, the C++ standard library doesn’t have a hash table data structure as is. Maps and sets are specifically asked to preserve order, so they internally use some kind of binary tree and their operations are O(log(n)). The new technical revision 1 (or TR1) for the C++ language features a standard container called unordered_set that would be very useful here to store all those lists and ask if one of them is already present. I will update the program code as soon as the revision becomes an official standard, and I have a compiler that implements it.
In any case, I’ve been using the C++ version while watching the show and most problems are solved in under a second. In the show, the contestants are given around 30 seconds to solve each problem. The program sometimes gives very short but “artificial” solutions to the problems. For example, if you are given numbers (100, 6, 3, 3, 1, 1) and asked to find number 606, you will probably do ((100 * 6) + (3 + 3)). However, the program, by choosing the shortest solution, outputs ((100 + 1) * 6). This is not a very good example, since someone may have done it the second way, but sometimes the solutions given by the program, while valid, are really weird. And I mean really weird.
If I wanted the program to give “human” solutions I would either output all possible solutions instead of only the best, or implement a new and somehow complicated criterion to choose solutions that use low and round numbers above weird solutions. But that’s not trivial to do, and not worth it in my opinion. The reason to choose the shortest solution in the first place was to avoid having solutions with unused results. Only as an example, the first solution the program finds for the last problem I mentioned could have contained (1 * 1) as one of the operations, eventhough it’s useless, and the resulting 1 may not even be used at all.
Hello, World!
2007-01-04
I just finished the basic blog setup. I’ll start adding content tomorrow, probably.