I'd rather be programming ...: February 2013

Sunday, February 24, 2013

Expanding Short URLs

Short URLs are use to save characters in a message, such as tweets or emails. The may also be used to create a permanent or easy to remember URL to sites that for some reason may change.

When a web-browser sends a request to the short URL it usually receives a redirect message with HTTP status 301. Most browsers, will automatically load the redirected location, but your own programs may want to take care of this themselves (see Fetching_Instagram_Pictures).

Offering Short URL service is a great way to collect user data. Not only can those sites track visitors to sites that they don't own, but also place cookies to the visiting browsers.

Here's a quick example:

$ curl -v http://t.co/z9j0Drd7

* About to connect() to t.co port 80 (#0)
*   Trying 199.16.156.11... connected
* Connected to t.co (199.16.156.11) port 80 (#0)
> GET /z9j0Drd7 HTTP/1.1
> User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.13.1.0 zlib/1.2.3 libidn/1.18 libssh2/1.2.2
> Host: t.co
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
< Date: Wed, 26 Sep 2012 02:17:41 GMT
< Cache-Control: private,max-age=300
< Expires: Wed, 26 Sep 2012 02:22:41 GMT
< Location: http://instagr.am/p/QBBn6jQPBQ/
< Content-Length: 0
< Server: tfe
< Connection: close
<
* Closing connection #0

Sometimes, a short URL may actually refer to another short URL. The process needs to be repeated in order to get to the final destination. The following shows example of functions that expand short URLs. The same process also tests if they are valid.

We use the CURL library http://curl.haxx.se that has bindings to a large number of programming languages, including PHP and Python.

The using CURL entails three basic steps:

Initialize with the target URL
Set options for the HTTP request, such as GET or POST method, any payload data, and control parameters. Callback functions for producing data that will be send to the URL, or processing data from the server's response will be declared here as well.
Execute the HTTP request. This function usually blocks until the response is received. (However, this can be changed.)

The example program below follows these three steps by calling the functions curl_init(), curl_setopt(), and curl_exec().

CURL can actually follow the redirects automatically, but there might be situation where one wants to see what happens in-between. In the example the line

$ret = curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0);

prevents the library function to follow the chain of re-directions. Instead, we resolve the short URL one step at a time.

#!/usr/bin/env php
<?php

$header = array();
$level = 0;

function getHeader($ch, $data) {
global $header, $level;
$hh = explode(":", $data, 2);
if (count($hh)>1) {
$header[$level][str_replace("-", "_", strtolower($hh[0]))]
= trim($hh[1]);
}
return strlen($data);
}

// dummy function, unless we need to store the entire page
function getBody($ch, $data) {
return strlen($data);
}

function expandURL($url) {
global $header, $level;

$level = 0;
$hc = 0;
do {
$header[$level] = array();

// Create a curl handle
$ch = curl_init($url);
$ret = curl_setopt($ch, CURLOPT_HEADER, 1);
$ret = curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0);
$ret = curl_setopt($ch, CURLOPT_RETURNTRANSFER, 0);
$ret = curl_setopt($ch, CURLOPT_TIMEOUT, 30);
$ret = curl_setopt($ch, CURLOPT_HEADERFUNCTION, 'getHeader');
$ret = curl_setopt($ch, CURLOPT_WRITEFUNCTION, 'getBody');

// Execute
curl_exec($ch);

// Check if any error occurred
$error = curl_errno($ch);
$info = curl_getinfo($ch);
$hc = $info["http_code"];
// Close handle
curl_close($ch);

if(!$error) {

if (isset($header[$level]["location"])) {
$url = $header[$level]["location"];
}
$level += 1;

} else {
echo "Error: $error\n";
break;
}
} while ($hc==301 or $hc==302);

return $url;
}

$expandedURL = expandURL($argv[1]);

echo "URL ".$argv[1]." ---> ".$expandedURL."\n";
$parsed = parse_url($expandedURL);
$hostpath = implode("/", array_reverse(explode(".", $parsed["host"])));
$parsed["hostpath"] = $hostpath;
$parsed["iterations"] = $level;
$parsed["shorturl"] = $argv[1];
$parsed["md5tail"] = md5($parsed["path"]);
print_r($parsed);
?>

The example uses a callback function getHeader() to process the header information. The function will be called for each line in the HTTP header. Most lines in the header start with a parameter, followed by a ':', and the value. The callback function adds these to the associative array $header. We also need to take care of the blank after the colon:

$header[$level][str_replace("-", "_", strtolower($hh[0]))] = trim($hh[1]);

Pretty much the same thing can be done with the CURL binding for Python, pycurl. However, the package "human_curl" makes it even easier

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import human_curl as hurl
import sys

status = 301
url = sys.argv[1]

r = hurl.get(url)
sc = r.status_code
level = 0;
while sc==301 or sc==302:
locations = r.headers['location'].split(' ')
url = locations[1]
r = hurl.get(url)
sc = r.status_code
level+=1

print "%s ---> %s (iterations: %d)\n" % (sys.argv[1], url, level)

The location entry in the header includes part of the original and the redirected urls. We need to split the string up, and use the second part.

Links:
http://www.php.net/manual/en/function.curl-getinfo.php
http://www.php.net/manual/en/function.get-headers.php
http://stackoverflow.com/questions/472179/how-to-read-the-header-with-pycurl

Sunday, February 17, 2013

Uploading files via email

The following article describes a method for posting new documents on a website. Or more generally, uploading files to a server for further processing. I often create white-board pictures and annotated view graphs in class, and need to post them on the class web site. I found the CamScanner iPhone app particular useful to take pictures of the whiteboard. The app finds the edges of the white-board, crops the image and runs keystone correction and other image processing algorithms to enhance the picture. The other tool is Notability on the iPad, that I use to annotate my view graphs in class. (I prefer writing on my iPad over using SmartBoard with its tedious notebook software: the hand writing has to be so big that one can get hardly anything on the board.)

The majority of these iOS app have a number of ways to get your documents of the device, including Dropbox, Google Drive, and even built-in HTTP servers. However, I choose email because it will also support our departmental document scanner. Furthermore, my inbox fills up daily with announcements of workshops, internships, and other opportunities that I would like to post on my site. There wouldn't be anything easier than hitting the forward button.

Technically, I could email my documents directly to the server. However, enabling sendmail brings a whole bag of responsibilities with it, and negotiating a port 25 with the IT authorities doesn't seem to be worth the trouble. Instead, the described methods use an external, publicly accessible email server, like GMAIL. For my project I setup a dedicated GMAIL account, though, one could also use once regular account, and fetch emails from a particular folder (or label).

To get started, one needs a Linux box, fetchmail, procmail, and the nmh package. These should be available in every Linux distribution; in many cases they're already installed.

The basic fetch mail configuration is explained in http://www.daemonforums.org/showthread.php?t=5590, this blog http://badcherry.wordpress.com/2006/03/30/fetchmail-without-sendmail/ shows how to get around the sendmail daemon.

Here's the setup for Centos 5:

Install the packages:

$ yum -y install fetchmail procmail nmh

Create user account under which the emails will be processed. I wouldn't use my regular user account, but it's possible to use the same account. In this example, the user account is "adriaan"

Create a .fetchmailrc file to test the connection to GMAIL

poll imap.gmail.com protocol IMAP 
   user "xxxxxxx@gmail.com" is adriaan here
   password 'mysecretpassword'
   fetchlimit 1
   keep
   ssl

Test with

$ fetchmail -v -m '/usr/bin/procmail -d adriaan'

When everything works, we change the script to:
```
poll imap.gmail.com protocol IMAP 
   user "xxxxxxx@gmail.com" is adriaan here
   password 'mysecretpassword'
   fetchlimit 1000
   ssl
```
The fetchlimit may prevent disaster if suddenly too many emails come to this account. We removed the "keep" option. From now on, mails will be removed from GMAIL. By default, the fetchmail program only load unread messages.

The next step is creating a script for downloading (and processing) the emails. The script could look something like this /home/adriaan/bin/getProcessMail:

#!/bin/bash
#
fetchmail -m '/usr/bin/procmail -d adriaan' 
inc -file /var/spool/mail/adriaan -truncate +inbox
# this is just collecting ... need to process ...

The MH tools will be used to separate email messages into individual files. There are even tools to extract attachments. Having the email messages in separate files makes processing them easier. However, one may consider deleting the files ones their content has been processed. In order to use MH for the first time, run the command:
```
$ install-mh
```
We need to run this script every ten minutes. Use the crontab -e command to edit the user's cron-table. Add the following line
```
*/10 * * * * /home/adriaan/bin/getProcessMail.sh
```

Now, the email messages will be automatically saved on our system, and we're ready to process them. Everybody could send emails to the account. If this is not desired, the processing script may first check the sender's address, and dismiss all messages that didn't originate from a list of approved senders. Alternatively, one could achieve the same with GMAIL's mail filters.
The MH (http://www.nongnu.org/nmh/package) has a number of tools to deal with the messages, headers, and attachments.