Limiting concurrent execution using GCD

Soroush Khanlou recently wrote The GCD Handbook, a cookbook for common Grand Central Dispatch patterns. It’s a great set of patterns, with code examples.

While a great example of semaphores, I wondered whether the code in the Limiting the number of concurrent blocks section could be improved – I’ll explain why below. Soroush and I emailed back and forth a little bit, and came up with the following.

In the example, Soroush shows how to use GCD semaphores to limit the number of blocks that are concurrently executing in a dispatch queue. The key part is the enqueueWork function:

func enqueueWork(work: () -> ()) {
    dispatch_async(concurrentQueue) {
        dispatch_semaphore_wait(semaphore, DISPATCH_TIME_FOREVER)
        work()
        dispatch_semaphore_signal(semaphore)
    }
}

The problem I saw here, which Soroush also notes, is that this approach starts a potentially unbounded number of threads, which are immediately blocked by waiting on a semaphore. Obviously GCD will limit you at some point, but that’s still a lot of work and a decent chunk of memory. While this code is necessarily simplified to introduce this use of semaphores, the bunch of waiting threads needled at me.

To achieve effects like this with queue-based systems, I often find I need to combine more than one queue. Here, in the solution Soroush and I got to, we need two queues to get to a more efficient solution which only requires a single blocked thread.

We use a concurrent queue for executing the user’s tasks, allowing as many concurrently executing tasks as GCD will allow us in that queue. The key piece is a second GCD queue. This second queue is a serial queue and acts as a gatekeeper to the concurrent queue. We wait on the semaphore in the serial queue, which means that we’ll have at most one blocked thread when we reach maximum executing blocks on the concurrent queue. Any other tasks the user enqueues will sit inertly on the serial queue waiting to be executed, and won’t cause new threads to be started.

import Cocoa

class MaxConcurrentTasksQueue: NSObject {

    private let serialq: dispatch_queue_t
    private let concurrentq: dispatch_queue_t
    private let sema: dispatch_semaphore_t

    init(withMaxConcurrency maxConcurrency: Int) {
        serialq = dispatch_queue_create("uk.co.dx13.serial", nil)
        concurrentq = dispatch_queue_create(
            "uk.co.dx13.concurrent", 
            DISPATCH_QUEUE_CONCURRENT)
        sema = dispatch_semaphore_create(maxConcurrency);
    }

    func enqueue(task: () -> ()) {
        dispatch_async(serialq) {
            dispatch_semaphore_wait(self.sema, DISPATCH_TIME_FOREVER);
            dispatch_async(self.concurrentq) {
                task();
                dispatch_semaphore_signal(self.sema);
            }
        }; 
    }

}

To test this, I created a Swift command line application with this code in main.swift:

import Foundation

let maxConcurrency = 5
let taskCount = 100
let sleepFor: NSTimeInterval = 2

print("Hello, World!")

let q = MaxConcurrentTasksQueue(withMaxConcurrency: maxConcurrency);
let group = dispatch_group_create()

for i in 1...100 {
    dispatch_group_enter(group);
    q.enqueue {
        print("Task:", i);
        if (sleepFor > 0) {
            NSThread.sleepForTimeInterval(sleepFor);
        }
        dispatch_group_leave(group);
    }
}

dispatch_group_wait(group, DISPATCH_TIME_FOREVER);

print("Goodbye, World");

Running this, Hello, World! should be printed, followed by Task: N in batches of five (or whatever you set maxConcurrency to), followed by a final Goodbye, World! before the application succumbs to the inevitability of termination.

Even here there is an interesting use of GCD. In order to stop the app terminating before the tasks have run, I needed to use dispatch groups. I hadn’t really used these before, so I’ll refer you to Soroush’s explanation of dispatch groups at this point; the above is hopefully straightforward once you’ve read that.

I uploaded an example project for this code to my GitHub account. It’s under Apache 2.0; hopefully it comes in handy sometime.

More can be... more

“Less is more”. It’s a frustrating phrase. Less is not not more; by definition, it never can be. But sometimes less is better. On the other hand, sometimes more is better. Mostly, from what I can tell, there are some things where less is better and others where less is worse. Often there’s a level which is just right: more is too much, and less is too little. Salt intake falls into that bucket.

Less is more is a paraphrasing of a whole book, The Paradox of Choice, written by Barry Schwartz. It’s become a bit of a mantra, one that is deployed often without thought: “this page is cluttered, we need to remove stuff; less is more, dude”. This closes a conversation without exploring the alternatives.

It’s looking increasingly, however, like the idea is either false or, at the very least, that the book length discussion is more appropriate than the three word version.

Iyengar and Lepper’s jam study from back in 2000, where all this stuff came from, is coming under fire: a whole bunch of other studies don’t find the same effect. This is all described in more detail in an article on the Atlantic.

A logical conclusion of less is more is that one is enough. For shopping, at least, one appears to be too few. This relates to the topic of framing. My reading is that if you only have one item in a given category in a store, it’s easy to worry that the price of that item is too high. Introduce even one more, and the pricing is framed. Add a more expensive item and you will increase sales of your previously lonely cheaper item: a framing effect will suddenly make your original item seem better value. The implication being that, without the price framing, customers will feel the urge to look elsewhere to be sure they have a good deal.

Looking online for articles that Schwartz has written, things seem to come back to this one from the HBR in 2006:

Choice can no longer be used to justify a marketing strategy in and of itself. More isn’t always better, either for the customer or for the retailer. Discovering how much assortment is warranted is a considerable empirical challenge.

That is, less can be better, but sometimes more can be better; and it’s expensive, but sometimes you just have to pay the price to find out.

Selecting a HAProxy backend using Lua

Once you’ve learned the basics of using Lua in HAProxy, you start to see a lot of places the scripting language could be useful. At Cloudant, one of the places we saw that we could make use of Lua was when selecting from the various backends to which our frontend load balancers direct traffic. We wrote a simple proof of concept, which I wanted to document here along with some of the problems we hit along the way.

Say we wanted to choose a backend based on the first component of the request path (i.e., a in /a/something/else). We actually don’t do this at Cloudant, but it is a simple, not-quite-totally-trivial demo.

When using HAProxy 1.5, you’d do something like this:

frontend proxy
  ... other settings ...

  # del-header ensures that we're using 'new' headers
  http-request del-header x-backend
  http-request del-header x-path-first

  http-request set-header x-path-first path,word(1,/)

  acl is_backend_set hdr_len(x-backend) gt 0
  acl path_first_a %[req.hdr(x-path-first)] -m a
  acl path_first_b %[req.hdr(x-path-first)] -m b

  http-request set-header x-backend a if path_first_a !is_backend_set
  http-request set-header x-backend b if path_first_b !is_backend_set
  http-request set-header x-backend other if !is_backend_set

  http-request del-header x-path-first

  use_backend %[req.hdr(x-backend)]

backend a
  ...

backend b
  ...

backend other
  ...

In outline, this code uses a couple of temporary headers to store the first path component and the backend we choose, combined with ACLs as guards to make sure that the right ordering priority is used for backends. In particular, the is_backend_set ACL prevents us always using the other backend.

This is fairly concise, but in my experience gets complicated quickly. Moreover, it hides the fact that the logic is essentially an imperative if...else if...else statement.

Thankfully, HAProxy 1.6 introduces both variables and Lua scripting, which we can use to make things clearer and safer, if not particularly shorter.

Variables

We can use variables to replace the use of headers for temporary data. Setting and retrieving looks like this:

http-request set-var(req.path_first) path,word(1,/)
acl path_first_a %[var(req.path_first)] -m a

This isn’t any shorter, but it does reduce the chance of a malicious request slipping in a header that affects processing.

Variables all have a scope: req variables are only available in HAProxy’s request phase; res in the response phase; and txn are stored and available in both.

Lua

Variables are nice, but are a fairly straightforward feature. Lua allows us to get a bit more interesting. Instead of the header/acl dance, we can now write the backend-switching logic more explicitly.

Assuming that we put the Lua code in a file called select.lua alongside the HAProxy configuration file:

global
  lua-load select.lua
  ... other settings ...

frontend proxy
  ... other settings ...

  # Store the backend to use in a variable, available in both request
  # and response (txn-scope)
  http-request set-var(txn.backend_name) lua.backend_select()

  # Use the backend_name txn variable
  use_backend %[var(txn.backend_name)]

Here, we use a Lua sample fetch function. Sample fetch is a HAProxy term for any function – whether in-built or written in Lua – that processes the HTTP transaction and returns a value calculated using the transaction details. The Lua function is automatically passed details of the request as part of the transaction details.

The backend returned is put into a variable in case it’s needed elsewhere. A txn scoped variable can be used in both request and response phases; using one, you could add a header to the response containing the chosen backend, for example. If this wasn’t needed, you could put the backend_select fetch directly into the use_backend line.

Warning: One thing that we found when trying out this code is that we couldn’t do what we used to and store the return value from the Lua code in a HTTP request header. If we did that, for some reason HAProxy returned a 503 status code, that is, the use_backend statement appeared to be trying to use a non-existent backend. Swapping to a variable fixed this.

The Lua code contained in select.lua ends up being straightforward:

-- Work out the backend name for a given request's HTTP path
core.register_fetches("backend_select", function(txn)

  # txn.sf contains HAProxy's in-built sample-fetches, like the HTTP path
  local path = txn.sf:path()
  local path_first = string.match(path, '([^/]+)')

  if path_first == 'a' then
    return 'a'
  elseif path_first == 'b' then
    return 'b'
  else
    return 'other'
  end
end)

In outline:

  1. core is a class exposed globally by HAProxy. One of the uses of core is to register Lua functions for use in HAProxy. The register_fetches call registers our sample fetch under the name backend_select. The sample fetch is a Lua function, declared inline in the call.
  2. The first part of the sample fetch function uses the txn argument. HAProxy provides this argument automatically to all Lua functions registered as sample fetches. The txn argument provides access to both the request context and a lot of the in-built HAProxy fetches for accessing data from the request. We use one of the fetches, path, to retrieve the path.
  3. We take the first part of the path using Lua’s match function, which we can make perform a split-like behaviour.
  4. Finally, we can do the if/else statement and return the backend name to use.

For me, after learning the basics of Lua, the most complicated part of this was figuring out what’s available on the txn variable. The Lua documentation directs you towards the standard HAProxy documentation, but I found it a bit hard to generate quite the right Lua code to access the fetches that HAProxy exposes (probably due to my unfamiliarity with terms like sample fetch when I started this proof of concept and that I’m new to Lua).

And there you have it. Once you get the right code, it’s quite short, but it took a few days to figure out all the moving parts from scratch.

Ruby & Couch

It’s a long weekend this week in the UK. I wanted to learn a bit more Ruby, so I decided to use the time to start writing a client library for CouchDB. Basically my day job at Cloudant, but in Ruby.

I first used Ruby back in about 2005, and this site was powered by a couple of Ruby incarnations: first a Ruby on Rails app for a time; then a fairly hokey static site generator. I think that lasted until around 2009 when I learned Python and switched to Google AppEngine. Even with this experience, I don’t know Ruby particularly well – I have never used it full time – but I think the library has come out okay so far.

The client is is fairly low-level, which is my preference for clients, though not everyone’s. One sets up a client, then makes requests with it. Each type of request – GET _all_docs, PUT /database/document and so on – is represented by its own class, an idea Soroush Khanlou calls templating. We also used this approach for Cloudant’s Objective-C client library and it seemed a good approach; this Ruby library extends on lessons learned there.

require 'rubycouch'

client = CouchClient.new(URI.parse('http://localhost:5984'))
response = client.make_request(AllDbs.new)
response.json
# => ["_replicator","_users","animaldb",...]

It’s got some neat features. Most things can be streamed rather than read into memory. I tried to pick something useful for each request, but aside from views, I ended up just providing the option to stream the data to a block.

However, some are a bit cleverer. I like the views implementation which sends each result to a block:

get_view = GetView.new('views101', 'latin_name')
client.database('animaldb').make_request(get_view) do |row, idx|
  # => 0: {"id"=>"kookaburra", "key"=>"Dacelo novaeguineae", "value"=>19}
  # and so on. `row` is always decoded JSON. idx just tends to be useful.
end.json
# => {"total_rows"=>5, "offset"=>0,"rows"=>[]}

I certainly learned a lot about Ruby writing this. Right now the library is pretty incomplete in terms of API coverage, but is quite usable for simple projects – and importantly should be easy to add and contribute to. Perhaps I’ll be able to take the time to polish it up. I hope I can. Meanwhile, it should be fairly simple to get to grips with if you want to try it.

Find it on GitHub.

Using a Cloudant API Key with Multiple Cloudant Databases and Accounts

If you use Cloudant’s dashboard to generate API keys and assign permissions, you’d be forgiven for thinking that an API key can only be used for one database. However, that’s just an unfortunate implication of the UI. In fact, you can give an API key permission to read or write any number of databases – even ones on different Cloudant accounts.

The key to doing this is to use Cloudant’s HTTP API rather than the dashboard. Here’s how.

1. Generate an API key

First, generate an API key using your Cloudant account’s admin credentials:

curl -XPOST -u mikerhodes 'https://mikerhodes.cloudant.com/_api/v2/api_keys'
{
  "password": "1dd24951d0cde9839abc094c1c49d49965908d23", 
  "ok": true, 
  "key": "blemanitillyindstooksong"
}

This key to understanding what we’re about to do is to think of this API key as a lightweight Cloudant user. Unlike an account, this user doesn’t have its own databases, but instead it can be granted permissions on any other database within Cloudant.

The key and password in this example are, of course, made up.

2. Assign the API key permissions on database(s)

Next, assign the API key permissions on a database. Permissions define what a request bearing that API key can do, for example the _reader permission allows reading documents amongst other things.

We’ll start with a database on the mikerhodes account where we generated the API key, animaldb. To do this, you need to retrieve the database’s _security document and modify it. First, get it:

curl -u mikerhodes 'https://mikerhodes.cloudant.com/animaldb/_security'
{
  "cloudant": {
    "bouninamendouldnimendepa": [
      "_reader"
    ],
    "mikerhodes": [
      "_reader",
      "_writer",
      "_admin",
      "_replicator"
    ],
    "nobody": []
  },
  "_id": "_security"
}

Let’s pull apart this security document:

  1. The cloudant field is where you assign permissions to Cloudant API keys and users. The sub-fields are keyed by username (or API key), and have an array of permissions for that user.
  2. The nobody field ensures that anonymous users have no permissions.
  3. The mikerhodes field grants my admin account all permissions. This is a bit redundant as the account admin gains all permissions on all databases by default.
  4. The bouninamendouldnimendepa is an API key I’ve already granted _reader permissions to.

To assign new permissions, add the new API key’s username/key to the cloudant field along with the permissions we want to give it. Here’s how to grant it _reader and _replicator:

curl -XPUT -d @- -u mikerhodes 'https://mikerhodes.cloudant.com/animaldb/_security'
{
  "cloudant": {
    "bouninamendouldnimendepa": [
      "_reader"
    ],
    "blemanitillyindstooksong": [
      "_reader", "_replicator"
    ],
    "mikerhodes": [
      "_reader",
      "_writer",
      "_admin",
      "_replicator"
    ],
    "nobody": []
  },
  "_id": "_security"
}

### hit ctrl-d *twice* to terminate and send the input ###

{"ok": true}

Note: here I use the -d @- option to get curl to read from stdin. This means you can just paste the new _security document, then hit ctrl-d twice to terminate and send the request body.

Here we’ve added blemanitillyindstooksong to the user list. Check using a GET to the security document to make sure it worked.

An interesting note is that for currently updates to _security you don’t need to supply a _rev along with the request as you would with normal document updates. This is because the _security document is unversioned as it’s never replicated. There have been a few thoughts around starting to require a _rev in future releases, so keep alert in case it does change.

3. Make requests using the API key

Now blemanitillyindstooksong can make requests to the database. First, let’s check the anonymous user really can’t access the database:

> curl 'https://mikerhodes.cloudant.com/animaldb'
{"error":"unauthorized","reason":"_reader access is required for this request"}

And now show blemanitillyindstooksong can:

> curl -u blemanitillyindstooksong 'https://mikerhodes.cloudant.com/animaldb' 
{
  "db_name": "animaldb",
  [ ... ]
}

4. Grant the API key permissions on other databases

The key here is that the steps for other databases are exactly the same as above, so go through steps (2) and (3) to grant the API key access to more databases, using whatever combination of permissions you require.

More ways to use API keys and permissions

There are less obvious ways that API keys and permissions can be used. The main two are:

  • You can grant permissions to an API key for databases hosted on accounts other than the one used to generate the API key.
  • You can grant permissions to another Cloudant user for a database on your account.

Granting permissions to an API key generated on a different Cloudant account

Here, I take the API key we generated above using a request to the mikerhodes account and I grant it permissions on a database on another of my accounts, mikerhodesporter:

> curl 'https://mikerhodesporter.cloudant.com/testuserpost/_security' \
    -XPUT -u mikerhodesporter -d @-
{
  "_id": "_security",
  "cloudant": {
    "nobody": [],
    "blemanitillyindstooksong": [
      "_reader",
      "_replicator"
    ]
  }
}{"ok":true}

And now I can use that API key to access the database:

> curl -u blemanitillyindstooksong \
    'https://mikerhodesporter.cloudant.com/testuserpost'
{
  "db_name": "testuserpost",
  [ ... ]
}

Allow different Cloudant accounts access to your databases

Here I grant my account mikerhodesporter access to a database on my mikerhodes account. This allows the owner of the mikerhodesporter account to access data in the animaldb database on my mikerhodes account.

Initially, I, of course, cannot use mikerhodesporter to access the database:

> curl -u mikerhodesporter 'https://mikerhodes.cloudant.com/animaldb'
{"error":"forbidden","reason":"_reader access is required for this request"}

So I update the security for the database using my mikerhodes account:

> curl 'https://mikerhodes.cloudant.com/animaldb/_security' \
    -XPUT -u mikerhodes -d @-
{
  "_id": "_security",
  "cloudant": {
    [ ... ]
    "mikerhodesporter": [
      "_reader",
      "_writer"
    ],
    [ ... ]
  }
}{"ok":true}

And now I’m able to access the database using the mikerhodesporter account credentials:

> curl -u mikerhodesporter 'https://mikerhodes.cloudant.com/animaldb'
{
  "db_name": "animaldb",
  [ ... ]
}

Recap

This was a bit of a whirlwind tour through Cloudant’s permissions and user mechanism. What are the key points?

  • An API key can be given permissions on any database within Cloudant, unlike what the dashboard implies.
  • A Cloudant account can be given permissions on any database within Cloudant using the same tools. This underlies Cloudant’s sharing functionality, which layers UI on top of these building blocks.

Broadly, Cloudant account and API key credentials are universal in that they can be used across the service to grant access to databases.