Update (2007-09-15): Recent changes to YouTube make this post partly obsolete, specially the references to /player2.swf. With time, the information regarding YouTube here will become obsolete, but I’m keeping the post open as the method described in here is still perfectly valid, and for the most recent mechanism you can download and read the code from youtube-dl.
In this post I will explain how I obtained the information needed to make software like youtube-dl and its cousins. Reverse engineering YouTube is a simple task that can be considered trivial when compared to reverse engineering a complex piece of software, like a device driver. The overall objective is to create a program that follows the same steps the web browser does, including downloading the video. However, instead of displaying it, we want to store the video file on disk.
It looked more difficult the first time, but when you do it twice you realize it’s always the same and you can reverse engineer other websites easily and create programs for them. You can base those programs on the source code of youtube-dl, metacafe-dl or whatever you prefer. That’s one of the points of releasing the program source code. To reverse engineer these websites I use three tools.
- A web browser. I use Konqueror, but you could also use Firefox, Safari, Opera, Internet Explorer… Any is valid if it lets you read the webpage source. Using a web browser that lets you run the flash plugin on demand is an advantage. Read ahead for more details.
- A sniffer. That is, a program that lets you capture and view the network traffic that goes in and out of your computer. I use Wireshark. It’s available for several platforms, including Windows.
- A basic URL retriever like wget, curl, telnet or nc. Something that lets you download web resources and see what’s happening from an intermediate level, between a sniffer and a web browser. I use wget.
This is the procedure I follow:
- I go to a video webpage, with something to block the flash plugin until I’m ready.
- I tell the browser to show me the page code, which makes a new window appear.
- I start Wireshark and tell it to capture all web traffic (using the traffic filter tcp port 80).
- I tell the broswer to activate the flash plugin so it downloads the swf object and the video file.
- As soon as the video starts to download, I stop capturing and close the browser keeping the source code window open.
- I filter captured packets with the display filter http, to see the HTTP requests.
Starting to capture the data when everything in the webpage except for the video player is loaded has the only purpose of eliminating a good amount of web requests that are only meant to display the original video webpage. That may include several images and other resources which are not important.
In the example I’m using, the browser requested the following web resources, apart from the original video webpage which does not appear in the Wireshark capture for the already mentioned reason. The URLs are written in hyperlink form because they are very long.
The first resource is the flash video player that YouTube sends you. This is not very useful. It would be if the player had the video URL hardcoded in it, but that’s not the case. Then, we must assume that the second request corresponds to the video player attempting to get the video file. The original webpage tells the browser to download the first resource in the list. All those parameters used to request the second resource are present in the video webpage, so you only need to extract them using a regular expression or other mechanism that guarantees you are extracting the correct parameters. I used a regular expression based on /player2.swf because it only appears once and is followed by the needed parameters. Your program should request the video webpage and then try to imitate what the YouTube video player does, which is to retrieve the second resource with those parameters, extracting them from the video webpage source code as we just explained. After you download that second resource, you have to extract the needed information so as to request the third and final resource, which looks to be the video file. So what does that second resource contain and how to extract the needed parameters to access the third resource? We’ll use wget to find out. We fire wget telling it to try to retrieve that second resource.
When wget tries to retrieve it, we can see that the server replies with a “303 See Other” and wget inmediately tries to follow that indication and requests the third resource.
So it was not that hard. Summary of what you need to do:
- Get the video webpage.
- Extract the /player2.swf parameters.
- Use them to request the second resource (/get_video)
- Follow the URL to start downloading the video.
Other sites such as metacafe.com are not that easy. For example, metacafe.com sends you an XML file when you retrieve what would be the equivalent of the second resource. That XML file contains information needed to compose the URL for the third resource. This means you have to use something like regular expressions again to extract more parameters and use them to compose the URL.
Other posssible requirements include getting a session cookie first and then confirming you want to be able to view adult material. The process is equivalent and may involve using Wireshark so as to get information on the form URL and the needed form parameters.