When a web-browser sends a request to the short URL it usually receives a redirect message with HTTP status 301. Most browsers, will automatically load the redirected location, but your own programs may want to take care of this themselves (see Fetching_Instagram_Pictures).
Offering Short URL service is a great way to collect user data. Not only can those sites track visitors to sites that they don't own, but also place cookies to the visiting browsers.
Here's a quick example:
$ curl -v http://t.co/z9j0Drd7* About to connect() to t.co port 80 (#0)
* Trying 199.16.156.11... connected
* Connected to t.co (199.16.156.11) port 80 (#0)
> GET /z9j0Drd7 HTTP/1.1
> User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.13.1.0 zlib/1.2.3 libidn/1.18 libssh2/1.2.2
> Host: t.co
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
< Date: Wed, 26 Sep 2012 02:17:41 GMT
< Cache-Control: private,max-age=300
< Expires: Wed, 26 Sep 2012 02:22:41 GMT
< Location: http://instagr.am/p/QBBn6jQPBQ/
< Content-Length: 0
< Server: tfe
< Connection: close
<
* Closing connection #0
Sometimes, a short URL may actually refer to another short URL. The process needs to be repeated in order to get to the final destination. The following shows example of functions that expand short URLs. The same process also tests if they are valid.
We use the CURL library http://curl.haxx.se that has bindings to a large number of programming languages, including PHP and Python.
The using CURL entails three basic steps:
- Initialize with the target URL
- Set options for the HTTP request, such as GET or POST method, any payload data, and control parameters. Callback functions for producing data that will be send to the URL, or processing data from the server's response will be declared here as well.
- Execute the HTTP request. This function usually blocks until the response is received. (However, this can be changed.)
The example program below follows these three steps by calling the functions curl_init(), curl_setopt(), and curl_exec().
CURL can actually follow the redirects automatically, but there might be situation where one wants to see what happens in-between. In the example the line
$ret = curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0);
prevents the library function to follow the chain of re-directions. Instead, we resolve the short URL one step at a time.
#!/usr/bin/env php
<?php
$header = array();
$level = 0;
function getHeader($ch, $data) {
global $header, $level;
$hh = explode(":", $data, 2);
if (count($hh)>1) {
$header[$level][str_replace("-", "_", strtolower($hh[0]))]
= trim($hh[1]);
}
return strlen($data);
}
// dummy function, unless we need to store the entire page
function getBody($ch, $data) {
return strlen($data);
}
function expandURL($url) {
global $header, $level;
$level = 0;
$hc = 0;
do {
$header[$level] = array();
// Create a curl handle
$ch = curl_init($url);
$ret = curl_setopt($ch, CURLOPT_HEADER, 1);
$ret = curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0);
$ret = curl_setopt($ch, CURLOPT_RETURNTRANSFER, 0);
$ret = curl_setopt($ch, CURLOPT_TIMEOUT, 30);
$ret = curl_setopt($ch, CURLOPT_HEADERFUNCTION, 'getHeader');
$ret = curl_setopt($ch, CURLOPT_WRITEFUNCTION, 'getBody');
// Execute
curl_exec($ch);
// Check if any error occurred
$error = curl_errno($ch);
$info = curl_getinfo($ch);
$hc = $info["http_code"];
// Close handle
curl_close($ch);
if(!$error) {
if (isset($header[$level]["location"])) {
$url = $header[$level]["location"];
}
$level += 1;
} else {
echo "Error: $error\n";
break;
}
} while ($hc==301 or $hc==302);
return $url;
}
$expandedURL = expandURL($argv[1]);
echo "URL ".$argv[1]." ---> ".$expandedURL."\n";
$parsed = parse_url($expandedURL);
$hostpath = implode("/", array_reverse(explode(".", $parsed["host"])));
$parsed["hostpath"] = $hostpath;
$parsed["iterations"] = $level;
$parsed["shorturl"] = $argv[1];
$parsed["md5tail"] = md5($parsed["path"]);
print_r($parsed);
?>
The example uses a callback function getHeader() to process the header information. The function will be called for each line in the HTTP header. Most lines in the header start with a parameter, followed by a ':', and the value. The callback function adds these to the associative array $header. We also need to take care of the blank after the colon:
$header[$level][str_replace("-", "_", strtolower($hh[0]))] = trim($hh[1]);
Pretty much the same thing can be done with the CURL binding for Python, pycurl. However, the package "human_curl" makes it even easier
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import human_curl as hurl
import sys
status = 301
url = sys.argv[1]
r = hurl.get(url)
sc = r.status_code
level = 0;
while sc==301 or sc==302:
locations = r.headers['location'].split(' ')
url = locations[1]
r = hurl.get(url)
sc = r.status_code
level+=1
print "%s ---> %s (iterations: %d)\n" % (sys.argv[1], url, level)
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import human_curl as hurl
import sys
status = 301
url = sys.argv[1]
r = hurl.get(url)
sc = r.status_code
level = 0;
while sc==301 or sc==302:
locations = r.headers['location'].split(' ')
url = locations[1]
r = hurl.get(url)
sc = r.status_code
level+=1
print "%s ---> %s (iterations: %d)\n" % (sys.argv[1], url, level)
The location entry in the header includes part of the original and the redirected urls. We need to split the string up, and use the second part.
Links:
http://www.php.net/manual/en/function.curl-getinfo.php
http://www.php.net/manual/en/function.get-headers.php
http://stackoverflow.com/questions/472179/how-to-read-the-header-with-pycurl