approaching infinity

sporadic writings from aaron maxwell

Sponsored Jobs

Friday, November 14, 2008

Does it take money to make money?

Share/Save/Bookmark

Let's talk about a phrase you have heard before:

"It takes money to make money."

Like many aphorisms, it must be handled with great care. It can be amazing that the words you listen to can affect what you believe, and thus how you act.

"It takes money to make money." One could use this idea to keep themselves poor, reasoning that they can never afford to become wealthier than they are now.

I would prefer that this not happen to you. So let me suggest a different wording that may be more profitable.

"It takes SOME money to make MORE money."

And very importantly,

"You can create SOME money out of nothing."

Is this dual aphorism true? Just as true as the original, perhaps more.

If you have money, can you utilize it to create even more? Yes, absolutely. Can you create money out of thin air? Yes, there are many ways. One way is to sell your time and labor to those who will pay for it. There are other ways.

Some recommended reading:

Share/Save/Bookmark

Friday, October 3, 2008

Django and Lighttpd: one niggling fastCGI detail

If you are setting up a Django website using the lighttpd web server, one of the easiest ways to configure it is via fastcgi, which lighttpd has built-in support for. As I write this, the official Django FastCGI docs do a very good job of explaining everything, provided you read the whole document. (Don't just skip to the lighttpd section - relevant info is contained earlier, perhaps even in the Apache sections.)

The docs skimp over one detail that happened to affect me, however, which I'd like to document here in hopes that it will save you some time: the FORCE_SCRIPT_NAME setting. At first, when I called up the admin site and clicked the submit button to log in, I got a 404 error, for a url like "/mysite.fcgi/admin/". mysite.fcgi (not its real name :) is the url prefix I configured for the fastcgi rewrite rule in the lighttpd configuration.

After much research and fruitless tweaking of the config files, I thought to look at the source of the admin login page (which served just fine). Turns out the action attribute of the login form was set to "/mysite.fcgi/admin/", not "/admin/" or ".", like it should be. At the tail end of the above docs, I got a clue as to the cause. Long story short, defining FORCE_SCRIPT_NAME to the empty string in settings.py solved the problem.

There... hope this saves someone an hour or two!

Tuesday, July 15, 2008

Crash course in Applied Functional Programming

If you're a software developer, something interesting has been happening in your field lately. Mainstream programming languages are starting to gain features from what used to be a decidedly academic domain: so-called functional programming.

Now, I'm sure you are a programmer just like me, who works in C or C++ or Java or Python or Ruby or Javascript or, God help you, PHP. And also just like me, you probably spend much of your free time reading math books.

What? Oh, right, that's not normal - I keep forgetting. The point is, more and more, languages that you actually use are picking up some rather theoretical ideas from computer science, and incorporating them very directly into the syntax of the language.

The debates about adding closures to Java, the growing popularity of languages like Python, Ruby and Javascript, and even some developments in C and C++ are all part of this. The problem for most of us, since we've been busy for years writing software that controls traffic signals, or calculates your paycheck deductions with all the decimal points in the right places, or otherwise doing important and useful stuff and doing it right — the problem is that the grown-up languages we use tend to heavily use imperative language constructs. Which we can become used to, and use it as our only hammer for every nail in our toolbox of stretched analogies. I mean, programming idioms.

And so we code our minds relentlessly into a deeper and deeper rut. Is there no hope? Perhaps not; you might as well hang up your keyboard and get a job stocking groceries or something.

Or, you can keep reading below, increasing your charisma, intelligence, and attractiveness to your choice of gender(s) as you learn some of the basics. It's a bit math heavy, but it'll be worth it! Or maybe not. Only one way to find out.

Map, Filter, Reduce

Three central concepts of functional programming are "map", "filter" and "reduce". Map is an operation that takes a function mapping a domain D to a range R, and a list of elements of D. It then produces a list of elements of R.

map(f, [d0, d1, d2...]) -> [f(d0), f(d1), f(d2)...]

Filter takes a function mapping some domain D onto the range [True, False], and a list of elements from D, and produces a subset of this input list:

filter(is_even, [1, 2, 7, -4, 3, 12.0001]) -> [2, -4]

Map produces a list with the same number of elements as the input list. (Or same cardinality, for a nonfinite list.) Filter will produce a subset (sublist) of the input set (list). In contrast, the reduce operation always produces a single element.

Reduce will take an operator - i.e., a function of two elements that produces a single element of the same type. It will also take a list of two or more of these elements. (You can extend it accept empty and single-element lists by specifying a default value; but let's ignore that for now.)

Reduce works as follows: take the first two elements; apply them to the operator, to get a single result; apply this result and the next element to the operator, to get a new result; and repeat, until there are no elements left, and you get a single final result. The time-honored "hello world" version of reduce is applying the addition operator to a list of numbers:

reduce(add, [1, 2, 4, 8]) -> 15

One place I've used reduce in production code is when creating SQL queries with a bunch of criteria ANDed together. It surprised the heck out of me when I realized that PHP includes some support for functional programming. Say you define a function join_by_dash() like this:

// in PHP
function join_by_something($a, $b, $what) {
    if('' == $a) {
        return $b;
    }
    if('' == $b) {
        return $a;
    }
    return $a . $what . $b;
}

function join_by_dash($a, $b) {
    return join_by_something($a, $b, "-");
}

So, join_by_dash('sonic', 'youth') would return 'sonic-youth'. Then you can join a bunch together with the array_reduce() function:

function join_by_comma($a, $b) {
    return join_by_something($a, $b, ", ");
}
$items = array('goat', 'termite', 'horse');
$listing = array_reduce($items, "join_by_comma");
// $listing is now "goat, termite, horse"

I've used this in constructing SQL queries like so (for, say, your groundbreaking web-next-dot-oh website):

function join_by_and($a, $b) {
    return join_by_something($a, $b, " and ");
}
$conditions = array(
    "age<$age",
    "gender=$gender",
    "iq<$max_iq",
    );
$where_clause = array_reduce($conditions, "join_by_and");

More commonly, if you code in PHP for any length of time, you have probably used the implode() function, which is a specialized kind of reduction in disguise:

$where_clause = implode(" and ", $conditions);

Reducing with join_by_*() and using implode() are exactly equivalent. You might imagine ways that statistics over a population could be calculated using reduce.

(The above is a heuristic example. So you nice people with the pitchforks there, please forgive that I did not protect against SQL injections, etc. I promise that in real life I did. Thank you.)

Map and filter are, of course, eminently parallizeable. Reduce CAN be, depending on the operator.

(Oh, and by the way, there are slightly different ways to implement reduce. The check for '' == $a is necessary because of PHP's notion of reduce, but not in all other languages. See the appendix for details.)

Iterators and Generators

You probably know what an iterator is; it's a idiom or device many computer languages have for swooping through a list element by element. Generators are very similar. Simply, a generator is a device that generates an iterator. What's the difference? Well, say you have some ordered set of elements you want to process. Maybe finding the maximum of the set. In C, would cycle through it like this:

extern int scores[HOWMANY]; // initialized elsewhere
int i;
int high_score = 0;
for(i=0; i < HOWMANY; ++i) {
    if(scores[i] > high_score) {
        high_score = scores[i];
    }
}

This demonstrates iteration, but not an iterator, which is meant to provide an abstract interface for cycling through the elements of a collection. Java has built-in iterators, but if it did not, one way to implement it would be like this:

class ScoreSourceIterator {
    private int[] scores = {};
    private int position = 0;
    public ScoreSourceIterator(String file_name) {
        // open the file for reading
 // then, fetch the scores, storing in this.scores
 this.load_all();
    }

    public boolean isEmpty() {
        if (position == scores.length()) {
            return true;
        }
        return false;
    }

    public int next() {
        if(isEmpty()) {
            // Signal we're out of scores by returning 0.  (Yes, we're
            // assuming that all scores are positive.)
            return 0;
 }
 return scores[position++];
    }
}

// elsewhere...
public int get_high_score(String scores_file) {
    int high_score = 0;
    int score;
    
    ScoreSource ss = new ScoreSource(scores_file);
    while((score = ss.next()) > 0) {
        if(score > high_score) {
            high_score = score;
        }
    }
    return high_score;
}

(By the way, please excuse any horrifying flaws in my purported Java code. I'm way out of practice with the language, and have not even verified that my examples compile. If it just hurts too much to ignore, pretend that it's a fictional language named Flava, a proprietary-cum-open platform whose marketing centers around taste puns and emeritus hip-hop artists.)

A problem with iteration is that you have to load the whole array. What happens in the C example if HOWMANY is 10**9, or you're parsing some similarly massive game score database? That's the problem generators solve. Generators behave just like an interator, except that rather that slurping in enough data to cause a core dump, it automatically loads reasonable pieces of data at a time, refreshing as needed:

class ScoreSourceIterator {
    // Generator, actually!
    private int[] buffered_scores = {};
    private int position = 0;
    
    public ScoreSourceIterator(String file_name) {
        // open the file for reading
 // ...
 // then, fetch the next N elements
 this.load();
    }

    protected int load() {
        int num_read = 0;
 position = 0;
        // read in the next set of scores
 // store it in this.buffered_scores
 // ...
        return num_read;
    }
    
    public int next() {
        int n;
     // are we empty? if so, fill up
        if(this.isEmpty()) {
            n = this.load();
            if (0 == n) {
                // we're really all out
                return 0;
            }
 }
 return buffered_scores[position++];
    }
}

Anyway, this is more memory efficient than schlurping in the list all at once. You're probably quite familiar with this pattern. So much, in fact, that now I realize it was silly to spend so many bytes creating this example. Sigh.

Oh well... at least now I can talk about how Python does all this. Python has strong language support for iteration. You can do things like:

for num in list_of_integers:
    x = 1.0 / num
    print "No divide by zero error... YET!"

The token list_of_integers is basically an iterator. It's a list too; lists in python know how to act like an iterator. In the simple C example, you had to specify how you go through the list. In the python example, though, you don't worry about the "how" - just the "what". The "how" of iterating is encapsulated in, and handled by, the iterator itself.

(That's almost the definition of a high level language, by the way: if it lets you program by specifying what instead of how.)

Suppose you have a pressing problem: you need, NEED to know whether there are more zeros or ones on your hard drive. There's just no way around it. You HAVE to know.

And yet, like most modern computers, your main hard drive is about two orders of magnitude larger in capacity than the system's memory.

Stupid RAM cartels! Why can't they constantly be in cutthroat competition, like those hard drive manufacturers!

Fortunately, you have a way out of this conundrum. Wielding Python (version 2.4 or higher), you shrewdly write a generator function that tears off memory-sized chunks of chewy hard drive data:

def allbits():
    # the bit buffer
    bits = [] 
    while True:
       if len(bits) = 0:
           # read in some reasonable number of bits
       yield bits.pop()

The yield keyword is a little interesting. What it does is return the value there... and the next time the function allbits() is called, instead of starting at the beginning, it starts at the next statement after yield. Of course, since the yield is in that inescapable while loop, it will return bits.pop(), then bits.pop(), then bits.pop(), ... pausing only to refresh the bit buffer as needed.

So you can do this:

excess_ones = 0
for bit in allbits():
    if 1 == bit:
       excess_ones += 1
    else:
       excess_ones -= 1

("But wait!" you cry. "The state of a hard drive is in constant flux. This method won't give you a count that is accurate at a particular instant in time." To which I reply, "Shut it before I kick you!!")

(Also, maybe you are wondering how the generator signals that it's out of bits. I won't go into it, but just trust me that there is a way that Python has for handling this: the generator will simply terminate, just like when you read the last element of a vector. Do a web search for "python StopIteration" if you're curious about the mechanism.)

List Comprehensions

Coincidently, this ties into map, reduce and filter, through something called list comprehensions. Python has pretty strong support for this. (It's not the first language with the feature, of course - that honor, as far as I know, belongs to Icon.) A list comprehension is a device (idiom) for constructing lists. Sounds simple, but it turns out to be pretty deep, because of how powerful it can be for certain programming patterns.

A commonly used built-in Python function is range(). It's called like range([start, ]stop[, step]):

# [0, 1, 2]
x = range(3)
# [2, 3, 4, 5]
y = range(2,6)
# [0, 2, 4, 6, 8]
z = range(0,10,2)

So if you want a list of numbers 1 to 10:

nums = range(1,11)

But what if you want a sequence list 1.5, 2.5, ... 10.5? You can do this:

nums = [x + 0.5 for x in range(1,11)]

That is a list comprehension. It is a mechanism for constructing a list. Let's get fancier:

positions = [4.7*sin(2*PI*x-0.17) + 2.2 for x in angles]

Or if you have a Person class, with a name attribute,

names = [person.name for person in people]

Now, notice that "for person in people" looks somewhat similar "for bit in allbits()" above. They are in fact quite related. In each case, iteration over a list is taking place. people is a fully formed list, meaning the whole contents have been calculated and stored in memory. It's not the product of a generator, which lazily calculates the values as they are needed. But you can certainly use a generator in list comprehensions.

Say you have, with utmost care, constructed a file containing many integers, as text, one per line. And you want a list of these numbers raised to the fourth power. God knows why; maybe you just like wasting time doing pointless things. In any event, construct your list in one line like so:

  quads = [int(line)**4 for line in open("ints.txt", "r")]

Get that? The built-in python function open(), for file reading, knows how to act like a generator. It will read the lines one at a time from disk (or N at a time - it's optimized appropriately), and return them one by one in sequence. int() obviously converts that to an integer, which is then quadded. (squared, cubed... quadded?)

You can apply map to generated lists:

  first_bits = map(extract_first_bit, open("really_long_lines.txt"))

And filter:

  expensive_cars = filter(is_not_cheap, [Car(car_data) for car_data in open("cars.txt")])

In fact, most languages that do list comprehensions let you do filtering inline. You do it in Python using the trailing "if" clause here:

junk_cars = [Car(car_data)
             for car_data in open("cars.txt")
             if "Ford" in car_data]  # ooh burn!

(Incidentally, Car is a class, and Car(car_data) creates a Car object. Python omits the "new" keyword of C++, Java, etc.)

And you can use reduce with a generator too:

  # may overflow, but won't run out of memory
  bitsum = reduce(add, allbits())

It gets even more interesting when you start using lambda functions, which adroitly leads us into the section titled...

Lambda Functions

Nah. Actually, I won't write about lambda functions. This is enough for now.

Whew! Let's see, that took... a bit over two hours of my life to write. For some value of "a bit". Not too bad, considering I would have otherwise wasted that time mindlessly flipping between the Discovery and Sci-Fi Channels. So, I hope you have enjoyed reading it, if indeed you have read this far without gouging out either your eyes or your monitor. Let me know if you have questions (and are foolish enough to encourage me) or, more likely, corrections ("Java doesn't have freestanding functions, you git!")

Gotta go. There's an episode starting of "Man Vs. Wild" I've only seen three times.

Appendix: Flavors of Reduce

About ten billion words ago, did you wonder why the PHP function join_by_something could not be shorter, like this:

function join_by_something_whynot($a, $b, $what) {
    return $a . $what . $b;
}

Well, the reduce operation has a fuzzy definition at its edges, and different languages provide reduce functions that behave in slightly different ways. Most let you specify a seed value, which will be used as the first argument in the first step. So in python, for example:

>>> import operator
>>> a = [1, 2, 3]
>>> b = [10]
>>> reduce(operator.add, a)
6
>>> reduce(operator.add, b)
10
>>> # now specify an initial value
... initial_value = 10
>>> reduce(operator.add, a, initial_value)
16
>>> reduce(operator.add, b, initial_value)
20
>>>

Some languages also use the seed if you use an array having fewer than two elements, using a sensible default depending on the array type (0 for numbers, etc.) Let's use this PHP function in an example:

// string join by dash
function sjoind($a, $b) {
    return $a . '-' . $b;
}

Now if you invoke array_reduce(array('Z'), sjoind), the result is '-Z'. Makes sense, right? Because the array only has one element, the reducing function takes two, so PHP uses its default seed value (the empty string).

Now try array_reduce(array('Z', 'Z'), sjoind). What would you expect the result to be? Think a second... Ok, time's up. Did you guess 'Z-Z'? WRONG! It's actually '-Z-Z'. Can you think of why?

It's because of how the PHP creators decided to implement array_reduce. In some languages, the reduce operation only uses the initial value when working with an array of fewer than two elements. PHP, however, always uses its seed value, in effect prepending it to the array.That's why join_by_something has to check its arguments.

(It's actually even more arcane than that: the seed value is actually the integer zero, which is just cast to an empty string in this context. Boy, now you are TOTALLY set to win the Geek Edition of Trivial Pursuit.)

Wednesday, April 23, 2008

Three Things I Love About Git

git-commit --amend

One of the benefits of using version control is organization. When you commit a set of changes, you are implicitly marking that set as defining a certain feature set, or otherwise having something in common.

Have you ever committed something, then realized you forgot to include something with it? Maybe it's a new file, or maybe it's an extra small change you forgot to make.

Enter git-commit --amend. The amend option adds to the tip of the current branch, and actually replaces the last commit with the combined change! It even seeds the message editor with the last commit's description, so you don't have to type it in again.

git-checkout -b

I often like to work in branches, even if what I'm implementing is not large enough to make a branch-commits-merge cycle really necessary. I find that making frequent small commits, without needing to think about whether the product is fully stable each commit, helps me focus and develop more quickly and cleanly.

In most VCSs, branching is something that takes a bit of effort, and for maximum ease, you want to plan it ahead of time. Sometimes I start developing, and suddenly decide what I'm working on is going to be big enough that I want to use a branch. I'm now forced to decide whether I want to make the effort to "backport" the work done so far.

With git-checkout -b, it becomes a very easy decision. This command says "create a new branch, based off the last commit, and transfer my working copy to this branch in-place, so I can commit to the branch as I please." I just execute git-checkout -b new_branch_name and keep going, with no interruption to my train of thought or work flow.

(Bonus: merging is relatively painless in git, especially compared to most centralized version control systems. Which just makes the ease of branching even more valuable.)

Fine-grained commits

Like I mentioned, one valuable benefit of using version control is organization: by making intelligent commits, code content diffs are automatically organized by feature set.

Sometimes, when I'm ready to commit, I have changes I want to include mixed in with some I do not. Or, I decide it would make more sense to split the current set of changes across several commits.

An example from this week: I was editing a Makefile, and realized it contained three sets of changes that really had nothing to do with each other.Had the make file been split up into several physical files, perhaps it would have been trivial to put into separate commits, but it in reality it was just one physical file.

Most VC tools I know of can resolve commit changes only down to the individual file level - they don't get fine grained enough that you can tell them, "commit this part of this file, and that part of that file, but not these other parts". You can always hack around it, by manually editing and reverting, etc. But that's often not easy nor quick.

But lo! I have a friend in git-commit --interactive. This starts up a shell that allows you to selectively add individual changes, what git calls "hunks", to the staging area for committing. In this case, the Makefile had three of these hunks. You can split up any hunk into smaller hunks, if you want to control what you're committing with even finer granularity. You can review the change set as you build it, adding, deleting or refining changes as you like. When satisfied, I exit, and the change set is automatically committed.

That's for the simple case of one file. It can be an even more powerful tool with a larger change set.

(By the way, if you don't want the automatic commit, just use git-add --interactive instead. It otherwise works the same way.)

Wednesday, January 16, 2008

High-level Language Features and Testing

When I first started doing test-driven development as a PHP coder, our development shop used Marcus Baker's excellent SimpleTest framework. I liked it a lot. Since then I've used unit test frameworks in C, Perl, Java, and Python, and SimpleTest is still my overall favorite in any language.

As I became more obsessed with interested in automated testing, however — reading books and blog articles, experimenting with new testing patterns, getting xUnit tattoos — I sometimes felt frustrated. Often I would want to write some kind of test in the framework and language, but one or both was just not powerful enough to express the idea cleanly.

It wasn't until I started coding a lot in Python that the cause hit me. Most xUnit frameworks, particularly if they provide good mock objects, are more than adequate in themselves to support any testing pattern I could come up with. SimpleTest certainly is. The problems I ran into came from the language itself.

Now, I don't mean to complain about PHP. Well, maybe a little. Okay, fine. I can't stand PHP. But really, it's a solid, practical language for web apps. It's also a high level and dynamically typed one, which makes many tests a lot easier to write and, importantly, parameterize – certainly easier than in a language like C, or even Java.

And I must admit that Python isn't perfect either, though that's hard, being totally infatuated with the language right now like I am. Oh, Python! You're so dreamy.

Ahem. Python has two high level features that PHP does not support well: functions as first-class objects, and closures. Once I started heavily coding in Python, I began to discover ways to use these features to create nicely powerful test cases.

You probably already have an idea of what is meant by "functions as first-class objects". Basically, if a language lets you treat a function as an object that you can manipulate as easily as any boring class instance, it supports this concept well. You can assign the function to variables, pass it as a parameter to a method, or even create a function dynamically and return it from another function. Some languages support all this better than others. Functional languages like Scheme and Haskell do first-class functions about as well as any language today. Javascript too, believe it or not. C does very slightly, in that you can sling around function pointers; also Java, if you are willing to write unholy code.

You may not know about closures. In essence, a closure is a function that is evaluated in a certain context, with a certain set of local variables; and then remembers that context even if it is invoked in a different environment. It's a function that sort of carries an external memory with it, in a way that seems a little spooky at first. One canonical example is a derivative function:

      # f is some numerical function.
      # dx needs to be close to zero, but not zero.
      def derivative(f, dx):
          def df_dx(x):
              return (f(x+dx) - f(x)) / dx
          return df_dx
    

df_dx is a closure. It's a numerical function of one argument, x, which yields (in a very bad approximation) the derivative value of f(x). And it will continue to work correctly even if used in a wholly different scope of the code where f and dx are not lexically visible.

(Here's a good article explaining first-class functions and closures more. Favorite quote: "Ruby is a good language for demonstrating features that ought to be in Java.")

Right now I'm the QA engineer at SnapLogic. It's a great place to be if you want to work with a herd of engineers who are all about ten times smarter than you are. Keeps my ego in check. Anyway, a good example of what I'm talking about is in our SnapLogic Python API, which can be conveniently compared to the SnapLogic PHP API. (pop and browse source: Python, PHP) These two client libraries provide a simple interface for accessing SnapLogic resources. They are very similar, both in their method signatures and underlying algorithms. In fact I implemented them in parallel, adding a feature to one, then translating it into the other.

(Switching rapidly between thinking in PHP and thinking in Python is a trip, by the way. Felt dizzy for those few days. All I can say is, good thing we do code reviews.)

My process was, naturally, to usually translate the tests first. And here is where where I got a good demonstration of how closures and first-class functions come in handy.

Consider a method that does something like this:

      def get(self, url):
          ...
          response = urllib2.urlopen(url)
    

(This is actually a real example.) urllib2.urlopen is a function that performs an HTTP GET request on a URL. (Python syntax: The function's name is "urlopen", and it is in the "urllib2" module.) Since this is supposed to be a unit test, not an integration test, I don't want it to actually ping anything across the network. The answer is to make get() accept a mock function:

      def get(self, url, urlopen=urllib2.urlopen):
          ...
          response = urlopen(url)
    

(There are other ways to do this. It would probably have been better to make urlopen a class property in this case, rather than polluting get()'s method signature. I blame my code reviewer.)

That urlopen parameter accepts a function object. By default, it uses the "real" one, urllib2.urlopen. So when called in real code, it looks like this:

      foo.get(url)
    

But I can also pass in a different, fake function, that only pretends to do what the urllib2 version does. Thus, in the test case, I invoke get() like this:

      foo.get(url, mock_urlopen)
    

where I earlier define that last parameter like this:

      def mock_urlopen(url):
          ...
          return some_mock_response_object
    

Pretty neat, huh? Problem is, you can't do this in PHP! Nor in many other languages. Well, okay, technically you can, if the life of you and everyone you love depends on it. But it's not pretty. In practice, you would take another approach, because the friction coefficient is just too high.

I'm not picking on PHP. It's the same in most languages you've heard of.

Actually, though, this is hardly even a mock function in the purest sense, because there's no business logic, no built-in testing of actual behavior. It does not even validate the uri parameter in any way.

At SnapLogic here, we try to use mocks a lot in our unit tests, mainly because I won't shut up about it. Normally we use the superb PyMock library. I actually didn't use any mock library for the SnapLogic Python client lib, for two reasons. One is that I wanted people to be able to run the tests without having to install another library. (As you know, because you've downloaded it already!) Another reason is that the mocking needs were not overly complicated, and so it was easy to implement something elegant on my own.

So what do I want here? I want to be able to pass a special function into the get() call that will be used as the URL opener. I want it to be able to expect to receive a particular value or values for the uri it is invoked on, and then return a particular value. And I'm allergic to boilerplate, so I want to generate these functions programmatically for different test cases.

Let's say the return value is an instance of a class called MockResponse. (urllib2 actually uses another mechanism, instantiating a class named OpenerDirector. But trust me, it's way more complicated than you'd be interesting in hearing about. So I just made a mock response class.)

A mock function that does not check its input may look like this:

      def mock_urlopen(uri):
          return MockResponse()
    

Doesn't do much. But one of my requirements is that it raise a commotion if it doesn't get the uri parameter it should. So we can add that:

      def mock_urlopen(uri):
          if "http://google.com/search?q=why+python+programmers+are+sexy" != uri:
              raise Exception("Your code doesn't work, bonehead")
          return MockResponse()
    

We also want to configure the MockResponse object somehow. Now, since I'm not an idiot, I don't want to code N different mock_urlopen variants for N test cases. So I write a function to generate functions:

      def mk_mockurlopen(expected, req):
          def mockurlopen(uri):
              if expected != uri:
                  raise Exception("Maybe programming isn't your thing.")
              return req
          return mockurlopen
    

So far, so good. I can now do something like this:

      for item in testcase_data:
          mockurlopen = mk_mockurlopen(item['expected_uri'], item['mock_response'])
          result = agent.get(item['input_uri'], urlopen=mockurlopen)
          self.assertEqual(result, item['expected_result'])
    

There is one thing that needs to be better. The above tests take place within unittest, Python's xUnit library included in its standard distribution. The self.assertEqual call is made in a test method of a unittest.TestCase instance. Right now, though, the assertion of uri's value when calling the mock urlopen is done in a very crude way, just throwing an exception. By inspecting the stack trace, we can figure out what is happening, but it's much more convenient to integrate it into the unit test framework, so that the harness can do the work for us of tracing down which assertion failed where.

That's easy enough:

      # A test case
      class TestOfGet(unittest.TestCase):
          # Utility to create a mock urlopen function, attached
          # to this case's context
          def mk_mockurlopen(self, expected, req):
              def mockurlopen(uri):
                  self.assertEqual(expected, uri)
              return req
          return mockurlopen

          # Now, use it in one or more actual tests
          def test_of_some_important_thing(self):
              ...
              mockurlopen = self.mk_mockurlopen(expected_uri, mock_request_object)
              ...
    

This is a closure. A function is created in the execution context of the unit test framework. It's invoked in a completely different context. The context from the assertion is still accessible to it, though. In fact, that's essentially where the assertion is made.

See why this is better than just throwing an exception? The unittest module, like all full xUnit libraries, includes an integrated reporting facility. This abstraction layer adds a lot of value, allowing different front ends, IDE integration, and so on. Creating this closure allows us to apply a precisely targeted, specific test at an important place deep in the code, while allowing it to be integrated in the reporting hooks with no effort on our part.

Ain't that sweet?

The stack trace when it fails looks like this:

Traceback (most recent call last):
  File "test/tests.py", line 512, in testmain
    actual = sr.count(td['rel_uri'], _urlopen=my_mock_urlopen)
  File "/home/amax/src/snaplogic/trunk/Packages/SnapLogic-py/lib/snappy/SnapLogicAgent.py", line 141, in count
    req = _urlopen(full_uri)
  File "test/tests.py", line 24, in mockurlopen
    self.assertEqual(expected, uri)
AssertionError: 'http://foobar.com/alpha/beta?sn.count=records' != 'http://foobar.com/alpha/beta'
    

Note that this specifies which test case failed (the one on line 512 of test/tests.py), where in the production code things went wrong (line 141 of SnapLogicAgent.py), and what the precise failure was (the missing sn.count GET parameter).

There's something beautiful about it. It's days like this that I love being an engineer.

Friday, June 15, 2007

Using Gnu Emacs With SCMUtils

If you are interested in both physics and computer science, there is a real treasure of a book you owe it to yourself to check out. It's called Structure and Interpretation of Classical Mechanics. The full text is online at that link, and you can also buy a physical copy. I won't describe it further in this post; good summaries can be found in the reviews on Amazon. Accompanying the text is an open-source numerical and symbolic math library called "scmutils", meant to be run in MIT's version of Scheme. (download and install instructions) MIT Scheme includes an Emacs-like editor and execution environment called "Edwin", which acts as an interface to the whole system. Edwin has some nice features, including a user-friendly debugger and symbol completion. Most of the time, however, I prefer to use full-fledged Gnu Emacs as the interface. Setting this up is actually pretty simple, once you have downloaded and installed scmutils. Just include the following in your .emacs file:
(defun mechanics ()
  (interactive)
  (run-scheme 
    "ROOT/mit-scheme/bin/scheme --library ROOT/mit-scheme/lib"
  ))
Replace ROOT with the directory in which you installed the scmutils software. (Remember to replace it in both places. If it is installed differently on your system, just make sure the string has the form "/path/to/mit-scheme --library /path/to/scmutils-library".) Restart emacs (or use C-x C-e to evaluate the mechanics defun), and launch the environment with the command M-x mechanics. There are a few nice Emacs and scheme-mode features you can use. A handy way to work in the environment is with two buffers open at once. One buffer will be in the mechanics enviroment (i.e. scheme interpreter) launched with the mechanics command; we'll refer to this as the "mechanics" buffer. The other buffer will be open to a file containing Scheme/Scmutils code. The idea is that you write your code in the file, save it, then tell the scheme process to load the file. You do this by switching to the mechanics buffer, then typing C-c C-l. (That's "L", not "one"). You will be asked for, and must specify, the file to load the first time. It is remembered on subsequent invokations, so after making a change to the file and saving, you can just switch to the mechanics buffer and type C-c C-l [ENTER] to reload. It's very simple and fast. In the mechanics buffer - which happens to be in the normal Emacs scheme interaction mode - a command history is kept. You can cycle up and down through the command history with C-up and C-down. If you make a mistake and get to the error prompt, or start a calculation that is taking too long, you can abort and return to the normal prompt by typing C-c C-c. Other useful commands can be found from M-x describe-mode .