OSCON 2007, Day 3

Today’s notes will be a bit more free-form. Now that the tutorial days are over and the main conference has begun, there are more sessions – and less time to write!


Tim O’Reilly raised the question of openness beyond source code. This felt a bit amorphous, but he did have a good point that when software is a service, availability of source code is not the whole story – if Google gave you their source, you couldn’t do anything with it. I can’t decide whether this is a real insight or a “duh”.

Spikes in the Google Trends graph for various bleeding-edge keywords seem to correspond with O’Reilly conferences, Tim notes with a smile.

Things Tim mentioned that he thought people should be aware of: ohloh.net, Foxmarks, Hadoop, Stumbleupon, Intel’s threading building blocks, open source hardware in general

Nat Torkington pointed out that not all contributions to humanity can be done in the form of source code, and plugged this effort to raise money for charity: http://ossx.org/r.

A guy from Intel announced their threading building blocks open source release, which abstracts parallelism to a higher level than pthreads et al. It’s C++ template based, and works with C too. Not really part of my world.

Simon Peyton Jones of Haskell fame, talked a bit about parallel programming. “I’m here to sell you a concept. parallel programming is essential.” See below for my notes on his afternoon session, which was much more detailed than this short keynote talk.

Mark Shuttleworth was interviewed by Tim O’Reilly. Mostly softballs. Shuttleworth did make an interesting comment to the effect that Launchpad is really a stopgap until we have federated development tracking systems that can coordinate activities across projects. This is something I’ve been thinking about a bit lately.

Below are my notes on various sessions I attended throughout the day. My raw notes were of widely varying quality and depth, and in rewriting from notes to blog-post I’m sure I didn’t manage to clean up every rough edge. Also, one or two sessions I attended and enjoyed with out managing to take any useful notes at all.

Session: Andy Lester, Managing Technical Debt

This talk was about treating your technical decisions like financial decisions: what’s the cost now vs. the cost later? I missed the first few minutes, but I did come in time for an interesting thread on how to sell your boss on technical improvements that don’t have user-facing deliverable attached. Because it just doesn’t fly to say, “I’m going to do a bunch of work that you don’t understand that won’t really have any visible effect.” Some solutions: quantify the improvements in turnaround and quality, and quantify the decrease of risk. Relate the work to similar (or cautionary) stories in the organization’s or department’s past to help make the case concrete.

A problem noted from the audience: it’s tempting to be the hero by delivering too fast, or incomplete in a way that works now but may cause you trouble later – that’s accumulating technical debt, and you shouldn’t do it.

Session: Simon Peyton Jones, Nested Data Parallelism in Haskell

Disclaimer: I’m one of those English-majors-gone-programmer, not somebody with a lot of background in mathematics. I loved this talk, but much of the material was a stretch for me, so there may be small or massive errors in my recapitulation. Corrections and clarifications are more than welcome.

Simon Peyton Jones is at the high end of the scale for both smarts and enthusiasm, so the pace of this talk was brisk. Luckily he also was very encouraging of questions – in fact, he demanded them!

The basic distinction he made at the outset was between flat data parallelism, in which you apply a sequential operation to a flat mass of data (divide a huge array into ten chunks and run the same algorithm on each, then use locks to synchronize the recombination at the end), and nested data parallelism, which is recursive and does not require you to predict where you’re going to need the parallelism.

Some examples he gave of the latter: sparse arrays, quicksort, divide and conquer algorithms, graph algorithms, physics engines for games, machine learning, optimization, constraint solving.

Data Parallel Haskell attacks the problems by a combination of a compact notation for parallel combination and a clever compiler that knows how to parallelize the operations.

An example he walked through: a set of sparse matrices (think of a sparse matrix as a list of index, value pairs) is hard to parallelize manually, because the vectors may be wildly different lengths, you may have more cores than vectors, etc.

One approach is to use a flattening transformation: string all the vectors together, so you have one long sequence (with some bookkeeping to know where they begin and end), then you can chop this up into equal-size pieces, and then each piece gets farmed out to a core or processor.

He also walked through a data-parallel quicksort example. (this makes the canonical Haskell quicksort look a little more practical (the non-parallel example has been criticized for not being a “true” quickstort because it doesn’t sort the items in place).

Beyond the flattening method, he also talked about “fusion”, wherein you interleave the generation and consumption of values. That’s hard to do in imperative language where you don’t necessarily know what kind of side effects may be involved in the generation or consumption loops on their own. This is where “purity pays off” as he says; both flattening and fusion depend on purely-functional semantics.

He went on to say that the data-parallel languages of the future will be functional languages, even if they are embedded in an imperative environment

The cool thing about the examples he showed is that they are not using a special purpose data-parallel compiler – it’s just the bleeding edge of GHC and a usable prototype will be out this year.

Though he was cautious, he said that performance seems to scale linearly with the number of processors.

“Data parallelism the only way to harness hundreds of cores.”

Session: Joe Gregorio, Atom publishing protocol

The Atom Publishing Protocol (“AtomPub” is the preferred abbreviation now) is a “core protocol”, which (I think) means that it is not intended to endlessly expand.

The pieces of a core Atom entry: title, author, id, date published, content

As of yesterday, AtomPub is now a proposed standard, will have an RFC number (RFC 4287).

Very RESTful:

  • POST an entry, you get back a “201 Created”
  • If-Match (etags support)
  • Accept-Encoding: gzip

AtomPub allows you to differentiate between the fetchable media item and the editable one (e.g. if your stuff is distributed across mirrors)

Google is using AtomPub heavily in GData: blogger, calendar, picasa, spreadsheets.

OpenSearch is a restful protocol for searching. Joe expects an opensearch extension.

Current count of implementations: 15 clients, 16 servers

Session: Luke Kanies, Puppet

Puppet is a tool for specifying and applying system configurations. You create a definition in a Ruby-ish DSL.

It’s loosely coupled – you can swap out specific components if you don’t like/need them.

Configurations are idempotent – you don’t have to keep track of whether you just ran the config or not.

You run the “facter” tool to gather and record relevant info about the system (client) the tool is running on.

There is not currently a repository of recipes or OS profiles, but it’s desired.

There are plans to move to some kind of “what kind of system am I” information (configuration) out into a separate component, e.g. so that nagios, which will do different kinds of monitoring depending on what kind off system it is.

On the client you can enable reporting, and messages about events will be sent back to the server.

Question from the audience: What about Capistrano? Capistrano is largely an application management tool, not a system management tool.

I’m definitely going to look at this – I don’t manage a lots of servers, but sounds like it would be a cool tool for keeping the configuration of a development server in sync with the live server.