503 Service Unavailable

2007-02-21

Reverse engineering YouTube

Filed under: Programming — rg3 @ 22:38

Update (2007-09-15): Recent changes to YouTube make this post partly obsolete, specially the references to /player2.swf. With time, the information regarding YouTube here will become obsolete, but I’m keeping the post open as the method described in here is still perfectly valid, and for the most recent mechanism you can download and read the code from youtube-dl.

In this post I will explain how I obtained the information needed to make software like youtube-dl and its cousins. Reverse engineering YouTube is a simple task that can be considered trivial when compared to reverse engineering a complex piece of software, like a device driver. The overall objective is to create a program that follows the same steps the web browser does, including downloading the video. However, instead of displaying it, we want to store the video file on disk.

It looked more difficult the first time, but when you do it twice you realize it’s always the same and you can reverse engineer other websites easily and create programs for them. You can base those programs on the source code of youtube-dl, metacafe-dl or whatever you prefer. That’s one of the points of releasing the program source code. To reverse engineer these websites I use three tools.

  • A web browser. I use Konqueror, but you could also use Firefox, Safari, Opera, Internet Explorer… Any is valid if it lets you read the webpage source. Using a web browser that lets you run the flash plugin on demand is an advantage. Read ahead for more details.
  • A sniffer. That is, a program that lets you capture and view the network traffic that goes in and out of your computer. I use Wireshark. It’s available for several platforms, including Windows.
  • A basic URL retriever like wget, curl, telnet or nc. Something that lets you download web resources and see what’s happening from an intermediate level, between a sniffer and a web browser. I use wget.

This is the procedure I follow:

  1. I go to a video webpage, with something to block the flash plugin until I’m ready.
  2. I tell the browser to show me the page code, which makes a new window appear.
  3. I start Wireshark and tell it to capture all web traffic (using the traffic filter tcp port 80).
  4. I tell the broswer to activate the flash plugin so it downloads the swf object and the video file.
  5. As soon as the video starts to download, I stop capturing and close the browser keeping the source code window open.
  6. I filter captured packets with the display filter http, to see the HTTP requests.

Starting to capture the data when everything in the webpage except for the video player is loaded has the only purpose of eliminating a good amount of web requests that are only meant to display the original video webpage. That may include several images and other resources which are not important.

In the example I’m using, the browser requested the following web resources, apart from the original video webpage which does not appear in the Wireshark capture for the already mentioned reason. The URLs are written in hyperlink form because they are very long.

  1. The flash video player.
  2. A second web resource.
  3. The final video URL.

The first resource is the flash video player that YouTube sends you. This is not very useful. It would be if the player had the video URL hardcoded in it, but that’s not the case. Then, we must assume that the second request corresponds to the video player attempting to get the video file. The original webpage tells the browser to download the first resource in the list. All those parameters used to request the second resource are present in the video webpage, so you only need to extract them using a regular expression or other mechanism that guarantees you are extracting the correct parameters. I used a regular expression based on /player2.swf because it only appears once and is followed by the needed parameters. Your program should request the video webpage and then try to imitate what the YouTube video player does, which is to retrieve the second resource with those parameters, extracting them from the video webpage source code as we just explained. After you download that second resource, you have to extract the needed information so as to request the third and final resource, which looks to be the video file. So what does that second resource contain and how to extract the needed parameters to access the third resource? We’ll use wget to find out. We fire wget telling it to try to retrieve that second resource.

When wget tries to retrieve it, we can see that the server replies with a “303 See Other” and wget inmediately tries to follow that indication and requests the third resource.

So it was not that hard. Summary of what you need to do:

  • Get the video webpage.
  • Extract the /player2.swf parameters.
  • Use them to request the second resource (/get_video)
  • Follow the URL to start downloading the video.

Other sites such as metacafe.com are not that easy. For example, metacafe.com sends you an XML file when you retrieve what would be the equivalent of the second resource. That XML file contains information needed to compose the URL for the third resource. This means you have to use something like regular expressions again to extract more parameters and use them to compose the URL.

Other posssible requirements include getting a session cookie first and then confirming you want to be able to view adult material. The process is equivalent and may involve using Wireshark so as to get information on the form URL and the needed form parameters.

2007-02-13

House, Cable Doctor

Filed under: Hardware — rg3 @ 18:24

I have encountered a hardware problem that I’m unable to solve. It’s a long story, but here it goes.

I have two computers at home. A laptop computer that we are going to call Computer L, which has a winmodem that works under Linux. We are going to call it Modem L. This computer is normally used in a room in the house which has a phone jack that we are going to call Socket L, and it’s used with a telephone cable and plug that we are going to call Cable L.

The second computer is a desktop computer which sits in another room, and that we will be calling Computer D. It has a Creative Blaster External V.92 Modem (serial port) that we’ll call Modem D. It’s in a room which has a phone jack called Socket D, and it had a telephone cable and plug called Cable D.

One day, the modem in computer D was no longer able to connect to the Internet. As far as I know, nobody had touched anything related to the connection. I’m pretty sure I didn’t touch anything, and nobody else in the house touched anything either. As soon as I found out, I thought one obvious problem was that somebody had changed the password without noticing (it uses KPPP to connect). I retyped the username and password very carefully, eventhough the connection process didn’t apparently reach the authentication step. As I supposed, that was not the problem because it didn’t change anything. I thought the modem could be damaged due to a storm or something similar (a storm fried my previous modem). Taking into account modems are not very cheap (over 30 or 40 euros usually), I decided to check a couple of things before buying a new modem. The modem was able to detect the dialtone, apparently. When activating the modem sound, I can hear it dialing properly, but a continuous high-pitched beep starts after a few seconds, together with the normal background modem noise (bzzz…). The “CONNECT” is never received, according to the KPPP logs, and the connection attempt timeouts after 60 seconds.

I then tried to discard a cable problem, so I replaced cable D with cable L. Both cables are normal telephone cables, as far as I know. Cable L is black and has a normal 4-pin RJ11 connector at both ends. Cable D is almost identical, as far as I could see, but in light grey. Cable L is a little bith longer and has the following text written on it: E157914 C <weird symbol that looks like a 9 and a U joined> AWM 20251 26AWG IA 60ÂșC 150V TLE. As I was saying, I tried cable L with computer and modem D to discard a problem in cable D. To my surprise, the modem connected and worked correctly with cable L. Just to be sure it hadn’t been a coincidence, I used cable D again and it failed as described above. I tried again cable L and it worked once more. I tried cable D and it didn’t work.

At this point, I was sure cable D was the problem, so I went to the supermarket and bought a similar cable the next day and got rid of cable D. Let’s call the new cable, Cable D2. I have cable D2 right here. It’s 2.1 meters long, about the same as the old cable D, but still shorter than cable L, and in white color. It doesn’t have anything written over it. I happily connected cable D2 to socket D and modem D, and tried to connect. It didn’t work. I got the exact same problem I had with old cable D. To discard concidences, I tried several times but I was unable to connect. I tried cable L again and it worked. I repeated the same tests a couple more times to confirm the results. At this point I was puzzled. The next thing I tried was to check cable D2 wasn’t damaged. After all, it could be bad luck and the shiny new cable D2 may have been broken too. I tested cable D2 with modem L and socket L, and it connected normally. So… cable D2 was not broken. Modem D is not broken because it’s capable of connecting everytime I use cable L. The only two components remaining were socket D and the phone jack in modem D.

If they were somehow damaged, maybe the plugs in cable L had something specific that made that cable successful. At this point in time, I noticed the plugs in cable L had something broken. The small plastic piece that goes “clic!” when you plug it in was missing in both ends. They broke long ago after plugging and unplugging the cable many times. So I broke them on purpose in cable D2. I tried it again, but was unable to connect. I tried both cable ends in both sockets and went as far as to make sure the cable was hanging in the air the same way cable L did. I carefully tested the cable several times but I was unable to connect with cable D2, while cable L kept working.

There is one obvious solution at this point, which is to forget about the problem and interchange cable L and cable D2. Cable L would be used with computer D and cable D2 with computer L. That’s not possible because cable D2, as cable D, is too short. To test it, I took advantage of computer L being a laptop and I put it closer to the socket D wall. However, it won’t reach where computer L sits normally and, furthermore, I wanted to know what the hell was going on!

So, from my point of view, everything boiled down to either socket D or the phone jack in modem D. I did one more test. I tried computer L (modem L) with cable D2 connected to socket D, and it worked without any problems. The only thing remaining is the phone jack in modem D, but I’ve been unable to connect with cable D2, eventhough I tried several times plugging it in harder, softer…

The next thing I’ll do is to buy a longer telephone cable. I’ll try it again with modem D and socket D and, if it doesn’t work, I’ll use it with computer L and leave cable L there, so at least I’d have everything working. However, the mystery would still be unresolved. I don’t know why cable L works, why old cable D stopped working or why cable D2 doesn’t work either, when using them with modem D.

If you have experienced a similar problem in the past, or if you have new tests to suggest or a possible explanation, please contact me. There’s a link called “Contact me” in the right side bar, as you may know. It takes you to my freshmeat user page. Don’t hesitate to request further details if you want. I’ll provide as much information as needed.

I took 3 pictures showing the more relevant physical details, with notes about things I found interesting. Differences between cable D2 and cable L, and something slightly unusual in the phone jack of modem D.

Creative Blaster V.92 Modem phone jack

Cable D2 phone plugs

Cable L phone plugs

Summing up

Several people requested a summary. Initial facts:

  • Connection stopped working. Replacing current cable with another cable from my laptop worked, apparently discarding modem problems.
  • Got rid of old cable, bought new cable. New cable didn’t work either, to my surprise.
  • Several tests consistently show that the new cable works in other setups, apparently discarding cable problems and bad luck.

Several tests consistently confirm the following results, all of them taking place in the desktop computer room, with the wall phone jack from that room:

  • New cable, laptop computer worked.
  • Different cable (the one from the laptop), desktop computer worked.
  • New cable, desktop computer did not work.

Can it be a problem in the old and new cables? I don’t think so because:

  • New cable works as a laptop cable replacement, in the laptop room.
  • New cable works in the desktop computer room, using the laptop.

Can it be a problem with the internal modem components? I don’t think so because:

  • It always works if I use the laptop cable.

Can it be a problem in the wall phone jack? I don’t think so because:

  • Laptop computer can always connect with it (only tried using the new cable, I could test with its own cable for completeness).
  • Desktop computer always works using the laptop cable.

Can it be a problem in the modem phone jack? That’s the current most viable explanation, but it works using the laptop cable for some reason. Maybe it’s only half-broken and the laptop cable has something special the other cables don’t have and that makes it work?

Update (2007-02-14)

I just bought a new long cable (let’s call it D3) that I’m going to use with the laptop, while I keep cable L with modem D, because it’s the only one to work.

Tried tests:

  • Cleaning the phone jack of modem D with vinegar. Didn’t work.
  • Using one of the cables that make the modem fail to connect, call the house number and see, using minicom if the modem sees the ring and is able to pick up the phone and hang up. It can. Yet it fails to connect for some reason.
  • Trying cable D3. Didn’t work.

Pending tests:

  • Try cleaning the phone jack in modem D with a pen eraser.

Someone called Jeff was smart enough to observe that cable L is different to the other cables in that its plugs are also reversed 180 degrees. This raises several questions. Is that important in analog telephony? Is that why cable L works? How could cable D work previously? Why did cable D stop working then? Cable D may have been like cable L and when it stopped working, it could have been genuinely broken. I then bouth cable D2 but it didn’t work because it was not reversed. If that’s the case, new question: why does cable D2 work when using it with the laptop? Tough questions.

2007-02-08

Gregorian and Julian calendars

Filed under: Programming,Software — rg3 @ 20:09

Some months ago my father asked me if it was possible to find out the day of the week for any day in history. I inmediately asked what range of years he was interested in, and he explained to me that he had a good amount of dates, some of which were relatively modern but the oldest ones dated back to the beginning of the XVIII century. I recalled using the cal Unix command to get the calendar for any month and volunteered to tell him as many dates as he wanted. cal is able to tell you the calendar for any month from year 1 to year 9999. This was alright until, one day, he asked me if I was sure that program used the correct Gregorian reformation data. I didn’t know what he was talking about because I was completely ignorant of the topic. I had heard we hadn’t always been using the current calendar, but I didn’t know when were the last changes introduced and what those changes consisted in. After all, I had been reading dates in history books and nobody ever mentioned anything. By searching in Google and reading the Wikipedia article on the Gregorian calendar I found out that the changes had happened not long ago. And what is even worse, that the changes were not applied to all countries at the same time.

I then headed to the cal manpage, only to find the following paragraph:

The Gregorian Reformation is assumed to have occurred in 1752 on the 3rd of September. By this time, most countries had recognized the reformation (although a few did not recognize it until the early 1900′s.) Ten days following that date were eliminated by the reformation, so the calendar for that month is a bit unusual.

Indeed, if we ask for the calendar of that month, we see that not 10, but 11 days were skipped. The first and main shock was finding out that not long ago people had skipped 11 days in one month. They jumped from September 2 to September 14! No, really, they did. I’ll paste the calendar for that month:

   September 1752   
Su Mo Tu We Th Fr Sa
       1  2 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30

In Spain and some other catholic countries the change had happened on October 1582, which compromised all the dates we had looked up prior to September 3, 1752 (Doh!). Why did they have to skip those days? In the Julian calendar, a leap year happened each 4 years to compensate for the inaccuracy of the 365 days calendar. But later, more precise calculations were done and they found out those were too many leap years. They introduced the modern leap year test (it’s a leap year if it’s divisible by 4, but not if it’s divisible by 100, unless it’s also divisible by 400) but also realized that the previous calendar had already introduced too many leap days, and that was why they were about 10 days behind and they needed to skip that amount of days.

I can imagine an hilarious situation of a group of scientifics going to meet pope Gregory XIII: “His Holiness, we found out we need to skip 10 days.” Gregory XIII raises an eyebrow, “What?”. “That we need to skip 10 days. Like jumping from October 4th to October 15th.” Gregory XIII blinks. They take out some papers full of charts and equations and start to explain everything to the pope. The amazing thing is that they really did it. I can’t imagine the situation if this was going to happen nowadays. People would print T-shirts with the sentence “I survived the Gregorian reformation”. Or the Benedictian reformation or whatever. This is the calendar for October 1582 in the countries that switched that year.

    October 1582
Mo Tu We Th Fr Sa Su
 1  2  3  4 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31

As I said before, not every country switched at the same time. Russia, for example, didn’t switch until 1918. That’s not even 100 years ago! I imagine the Gregorian reformation is or was a problematic point for law people, insurance companies and others, and some computer software probably has or had to take it into account to calculate things.

Going back to the problem with my father, I started to search for a program that would use the correct reformation data but at that moment I couldn’t find any. Fortunately, the CVS server for the FreeBSD project has web access to the cal source code and someone was nice enough to describe the algorithm needed to get the calendar for any month[1], and I created a Python program first and a C++ program later that worked like a simplified cal and used the Spanish reformation data. I can mail both on request if you want. However, I don’t think it’s needed because later I found some websites which are able to give you calendars depending on the country you live[2] and I also found out about a tool called ncal which also has a country selector. It’s present in modern BSDs as well as some Linux distributions. Unfortunately for me, Slackware doesn’t have it. Debian users, for example, have it as part of the bsdmainutils package at the time I’m writting this.

The algorithm this programs feature is quite simple once you know a key fact: January 1, 1, was Saturday. From then, you only need to calculate the number of days that have passed until the day you need. This calculation is usually divided in the following steps:

  1. Take the number of previous years and multiply by 365.
  2. Add one more day for each previous leap year. This number can be easily computed as the year number divided by 4 for every year previous to the reformation, and substracting the number of years divisible by 100 and adding the number of years divisible by 400 for the amount of years passed since the reformation.
  3. Then, add the number of days in the previous months of the year you want.
  4. Add one more if the year you want is a leap year and you want a month after February.
  5. Finally, substract the number of days skipped in the Gregorian reformation if it’s a later date.

Knowing the previous number, plus knowing January 1, 1 was Saturday, plus using the modulus operation, you can know the day of the week any month starts at, and use it to print a calendar.

Links

  1. FreeBSD cal README.
  2. Online calendar generator.

2007-02-02

What are Unicode, UTF-8 and UTF-32?

Filed under: Programming,Software — rg3 @ 20:06

This is a fairly simple topic, easily addressed by checking out Wikipedia or some other online resource. However, let’s clarify it. To put it simply, Unicode is a standarized character table that associates every character with a number, its character code. For example, the occidental character a has the unicode character code 97, or 61 in hexadecimal, and it’s also often refered to as U+000061.

UTF-8 and UTF-32 are, on the other hand, encoding schemes. They specify the byte or octet sequence associated to a given sequence of unicode character codes. They also specify, given a sequence of bytes or octets, which unicode character code sequence, if any, they represent. For example, let’s suppose you want to store the Unicode character number 97, that is, the unicode character U+000061, which is the letter a. It will be stored in different ways depending on the chosen encoding scheme. For example, UTF-32 is quite simple and direct. It uses 4 bytes (octets) to store any Unicode character, and each 4-byte group is the result of storing the character code as a 4-byte natural number. I’m not sure if it’s stored in big endian or little endian format, in case any of those is specified, but it doesn’t matter for this example and we will suppose it’s big endian. The letter a would be stored as 00 00 00 61.

UTF-32 is the only encoding scheme to use a fixed number of bytes for every character in Unicode. UTF-8, for example, uses a variable number of bytes to store each character, depending on the character code. UTF-8 was designed with a purpose in mind: to be backwards compatible with the ASCII character set. This means that the first 128 possible bytes (from 00 to 7F) would be used to store the same characters present in the ASCII character set and, moreover, they would only be used to store those characters. UTF-8 guarantees, for example, that if the byte 00 appears in a sequence, it would represent the null character and wouldn’t be part of any other multibyte sequence associated to some other character. For this reason, the letter a would be stored in UTF-8 as a single byte 61.

Why is that important? Because much software and many protocols don’t need to be rewritten to be compatible with UTF-8. For example, in the C programming language, every string is terminated by a null character ‘\0′ which marks the end of the string. The description of UTF-32 is incompatible with this definition. The a letter would be stored as 00 00 00 61, as we saw before. You can see the null byte appearing 3 times in the sequence, and in none of them it marks the end of the string, may the letter a appear in the middle of a string. Another example: in Linux, a file name can contain any octet, except 00 and 2F (the null character and the ASCII code for the directory separator ‘/’). Thanks to the definition of UTF-8, the Linux kernel is prepared to deal with UTF-8 filenames without making any changes. The byte 00 can still be used as the string terminator and the byte 2F can still be used as the directory separator. The definition of UTF-8 guarantees those two bytes won’t appear anywhere except when they represent those exact same characters.

To sum up, Unicode is a standard table that, among other things, associates characters with a unique character code, and UTF-8, UTF-32 and others are encoding schemes. They describe how to represent those character codes using bytes or octets. Notice how, in the encoding schemes we described, the representation of character number U+000061 contains the byte 61. This is the result of trying to make things somehow simple. You could describe another encoding scheme yourself in which, for example, character U+000061 is encoded as the sequence 67 90 F2 if you wish. Any encoding scheme would be valid as long as it lets you transform a sequence of character codes to a sequence of bytes, and the resulting sequence of bytes back to the same sequence of character codes. However, not every encoding scheme is backwards compatible with ASCII.

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.