Bumping re-frame CI runners to ubuntu-latest

Quick maintenance task today on re-frame. The GitHub Actions runners were still pinned to an older Ubuntu version, so I updated them to ubuntu-latest along with the actions/cache dependency.

It's a small change but these CI maintenance tasks add up. Newer runners mean faster builds and better caching. The docs workflow also needed updating.

- runs-on: ubuntu-20.04
+ runs-on: ubuntu-latest

Ten years of maintaining re-frame now. Time flies.

Consolidating workers into a unified Docker container

Major refactor this week - consolidating curves_api, optim8_api, and maxim8 into a single unified worker module.

This removed about 700 lines of duplicated code and simplified our deployment significantly.

The tricky bit was implementing graceful shutdown handling. Each worker needs to finish its current job before the container stops:

def handle_sigterm(signum, frame):
    logger.info("Received SIGTERM, finishing current job...")
    global running
    running = False

signal.signal(signal.SIGTERM, handle_sigterm)
signal.signal(signal.SIGINT, handle_sigterm)

Also added a health check aggregator that reports healthy only when all three workers are ready. Much cleaner architecture now.

Socket timeouts and retry cascades

Spent a few days chasing down a cascading failure in our webhook logging. Turns out when one service times out, the retry logic was hammering it repeatedly, which caused more timeouts, which caused more retries...

The fix was two-fold. First, set explicit socket timeouts to fail fast:

import socket
socket.setdefaulttimeout(30)  # 30 seconds max

Second, add exponential backoff to the retry logic instead of immediate retries. Simple stuff but easy to forget when you're focused on the happy path.

Adding BVOD impressions to the reach and frequency API

MediaWise needed BVOD (Broadcast Video-on-Demand) support. The tricky part was handling curves that are BVOD-only versus mixed linear+BVOD.

The reach and frequency API needed updating to call out to the forecasting service for BVOD impressions, then merge those with the linear TV data. Also had to add VOZ market filtering and fix some XPath expressions that assumed linear-only data.

Market-specific logic is always messier than you expect.

ClickHouse and 64-bit integers

Building alloc8, our new allocation optimizer, I chose ClickHouse for the audience data. Fast columnar queries, perfect for aggregating millions of audience records.

One gotcha: ClickHouse returns 64-bit integers as strings in JSON responses to avoid JavaScript precision issues. Needed custom parsing on our end.

The other design decision was using hash-based allocation instead of randomization. When you're partitioning audiences, you want deterministic results - run the same query twice, get the same partitions. Random allocation makes debugging impossible.

def allocate_partition(audience_id, num_partitions):
    return hash(audience_id) % num_partitions

CBC changed its command-line parsing

Got reports that PuLP was failing with newer CBC versions. The solver would just ignore options like timeMode elapsed.

After some digging, turns out CBC changed its CLI parsing around v2.10.5. Options that used to work bare now need a hyphen prefix: -timeMode elapsed.

The fix in coin_api.py was simple - prepend hyphens to all parameters automatically:

def solve_CBC(self, lp, ...):
    # Old: cmds.append(option)
    # New:
    cmds.append('-' + option)

This maintains backward compatibility while supporting the new CBC convention. PR #637 merged and users can upgrade their solver without breaking their code.

Floating licenses and solver cleanup in PuLP

Had an interesting issue with Xpress solver integration in PuLP when deploying on a cluster with floating licenses.

The problem: after solving, PuLP calls xp.free() to clean up. But this releases the license before you can access the solution on the problem object. Works fine with node-locked licenses, breaks with floating license servers.

The fix requires decoupling the license lifecycle from the problem object lifecycle. Filed issue #585 to track this - it's a pattern that affects other commercial solvers too (Gurobi, Mosek).

Server-side rendering with re-frame and dynamic app-db

Merged PR #738 to re-frame today. It enables server-side rendering by making app-db a dynamic variable instead of a static atom.

The problem with SSR in re-frame was that all requests shared the same app-db. In ClojureScript that's fine (single-threaded), but on the JVM you need request isolation.

The solution uses Clojure's dynamic binding:

(with-bindings {#'re-frame.db/app-db-id db-id}
  ;; Each request gets its own app-db instance
  (render-to-string [my-component]))

ClojureScript behavior is unchanged - single instance. But now Clojure can handle concurrent requests with isolated state. Clean.

Updating PuLP's CBC solver to 2.10.3

Finally updated the CBC binaries bundled with PuLP. We'd been shipping v2.3 from February 2015 - a six year old solver!

The new v2.10.3 (December 2019) comes from AMPL's solver repository with static linking for maximum portability. Should work across Linux, macOS, and Windows without dependency issues.

The tricky part was testing across all platforms. Asked for help in PR #426 to verify it works on systems I don't have access to. Open source maintenance requires community.

Fixing re-frame in Web Workers

re-frame v1.0.0-rc2 broke in Web Workers. Users were getting "window is not defined" errors - a regression from v0.12.0.

The fix was simple once I found it. In interop.cljs, we were using js/window to detect the environment. But Web Workers don't have a window object.

;; Old (broken in workers)
(def ^:private request-animation-frame
  (.-requestAnimationFrame js/window))

;; New (works everywhere)
(def ^:private request-animation-frame
  (.-requestAnimationFrame js/self))

js/self refers to the window in the main thread, and the worker object in a Web Worker. PR #615 merged.

Python 3.8 literal comparison warnings in PuLP

Python 3.8 got stricter about using is for literal comparisons. PuLP was throwing SyntaxWarnings on import:

# Bad (triggers warning)
if x is 0:
    ...

# Good
if x == 0:
    ...

Also had to fix some invalid escape sequences in docstrings. The \s in a docstring is now a warning unless you use a raw string.

These Python version upgrades always surface old code patterns that used to be fine but are now deprecated. PR #259 fixed it.

Moving PuLP CI to GitHub Actions

Finally migrated PuLP's CI from Travis to GitHub Actions. The new setup tests across Linux, macOS, and Windows with multiple Python versions in parallel.

Also dropped Python 3.4 support and removed flake8 from the test suite. Nobody's running 3.4 anymore and flake8 was failing on code that worked fine.

Had to debug the workflow in November (PR #239) - some edge cases with the test matrix weren't working. CI/CD is never set-and-forget.

Testing PuLP across Python versions

Expanded PuLP's test matrix to cover Python 3.5 through 3.8. With scientific computing libraries, you can't just test on latest - people have production environments pinned to specific versions.

PR #200 added the additional Python version configs. It's more CI time but catches compatibility issues before users hit them.

SafeConfigParser deprecation in Python 3

Python 3.6 deprecated SafeConfigParser in favor of just ConfigParser. PuLP's solver configuration was using the old name and throwing deprecation warnings on every import.

# Old (deprecated)
from configparser import SafeConfigParser
config = SafeConfigParser()

# New
from configparser import ConfigParser
config = ConfigParser()

Small fix but deprecation warnings make users nervous. They shouldn't have to wonder if the library is maintained.

Adding IAM roles and spot instances to dask-ec2

Contributed a couple of features to dask-ec2, a tool for spinning up Dask clusters on AWS.

First was IAM role support (PR #60). Workers need AWS API access for things like S3, and passing credentials around is a security headache. With IAM roles, the instance just assumes the role automatically.

Second was initial spot instance support (PR #66). Spot instances are 60-90% cheaper than on-demand, which matters when you're running big computation clusters. The tricky part is handling the instance lifecycle - spots can be terminated with 2 minutes notice.

Building a visual debugger for re-frame

Been working on re-frisk, a debugging tool that shows your re-frame app-db as an interactive tree. When you're tracking down why a component isn't updating, being able to see the actual state structure is invaluable.

It hooks into data-frisk-reagent for the tree visualization. You can expand/collapse nodes, see what changed after each event, and trace how state flows through your subscriptions.

Debugging tools are unglamorous but they save hours of println debugging.

Reagent 0.6.0 broke nil handling in inputs

Reagent 0.6.0 introduced a subtle breaking change. Input components that used to accept nil values now require explicit booleans.

In 0.6.0-rc, you could write:

[:input {:type "checkbox"
         :checked item-id}]  ; nil or truthy

But 0.6.0 final is stricter - it wants actual true/false:

[:input {:type "checkbox"
         :checked (some? item-id)}]

Created a bug demo repo to document this, then fixed re-com's checkbox components (PR #111). When frameworks change their contracts, downstream libraries need updating.

Caching in transit-cljs

So I've been doing alot of coding in clojurescript lately and I had to do some work with transit-cljs.

One of the things I had to test was if I could do comprehensive caching and de-duping of objects serialized with transit-cljs. I based my tests on transit-js-caching but this article has an example in javascript and not clojurescript. So I needed to convert it. Here is how I did it but you will need to read the previous article for the motivation of each step.

1) include the transit-cljs library in your code

(ns main.transit
   (:require [cognitect.transit :as t]))

2) define an Object in clojurescript

(defrecord Point [x y])

you can also use deftype but defrecord provides more helper functions like equality etc

3) define a custom write handler with transit-cljs and a function that uses it

(def PointHandler
  (t/write-handler (fn [v, h] "point") (fn [v, h] [v.x v.y])))

(def writer (t/writer "json" {:handlers {Point PointHandler}}))

(defn write [x]
  (t/write writer x))

3) use this writing function

(def p (Point. 1 2))

(print (write p))
(print (write [p p p p]))

gives

"[\"~#point\",[1,2]]"
"[[\"~#point\",[1,2]],[\"^0\",[1,2]],
 [\"^0\",[1,2]],[\"^0\",[1,2]]]"

see by default transit-cljs caches the keys in the JSON

4) create a custom reader

(defn read [x]
   (t/read (t/reader "json"
                     {:handlers {"point"
                                (fn [v]
                                    (Point. (aget v 0)
                                            (aget v 1)))}})
           x))

5) now we can do a roundtrip!

(print (= p (read (write p))))
(print (= [p p p p] (read (write [p p p p]))))

true
true

6) How do we write a caching writer

(defn caching-point-handler
  ([] (caching-point-handler (atom {})))
  ([cache]
   (t/write-handler
     (fn [v, h] (if (get @cache v)
                  "cache"
                  "point"))
     (fn [v, h] (let [id (get @cache v)]
                  (if (nil? id)
                    (do
                      (swap! cache
                             #(assoc % v (count %)))
                      [v.x v.y])
                    id))))))

(defn c-writer
  []
  (t/writer "json" {:handlers {Point (caching-point-handler)}}))

(defn c-write [x]
  (t/write (c-writer) x))

7) what does this produce

(print (c-write p))
(print (c-write [p p p p]))

"[\"~#point\",[1,2]]"
"[[\"~#point\",[1,2]],[\"~#cache\",0],
 [\"^1\",0],[\"^1\",0]]"

note that the second item is cached and the "cache" keyword is replaced by ^1 in the third item

8) what does the reader look like

(defn c-read [x]
  (let [cache (atom {})]
    (t/read (t/reader "json"
                      {:handlers {"point"
                                  (fn [v]
                                    (let [point (Point.
                                                   (aget v 0)
                                                   (aget v 1))]
                                      (swap! cache
                                             #(assoc %
                                               (count %) point))
                                      point))
                                  "cache"
                                  (fn [v]
                                    (get @cache v))}})
            x)))

9) now we can roundtrip

(print (= p (c-read (c-write p))))
(print (= [p p p p] (c-read (c-write [p p p p]))))))

true
true

My top n tips for python coding in Optimisation

  1. Always write code for readability first. Pre-optimisation is the root of a lot of evil and often makes the code more difficult to improve later. See http://chrisarndt.de/talks/rupy/2008/output/slides.html for what is pythonic. The python language is really been written to make the clearest code the most efficient see Ted's example where

    >>> for key in my_dict:

    looks better and is more efficient then

    >>> for key in my_dict.keys():

    which is inefficient as it creates a new list instead of returning an iterator

  2. If your code is slow use a profiler to find out why before you make changes. Either use the logging module and print timestamps or use the cProfile module. I can't tell you the number of times I have assumed the slow code was for one reason and then found it was another. Also http://code.google.com/p/jrfonseca/wiki/Gprof2Dot is an excellent tool for making pretty graphs

  3. Rigorously validate any optimisation. If your changes don't speed up the code revert them, this is a sliding scale for readability as above, so if you change improves readability but does not change the execution time leave it in. If your change makes the code impossible to read and understand, use lots of comments and only keep it if the speed increase is more than 30% or one minute.

  4. (actual python hints start here) Too much filtering. I have seen this a number of times when a using list comprehensions the programmer filters the same list multiple times. For instance in a machine scheduling problem (written in pulp, product_mapping is a list of machines a product can be made on and allocate is a dictionary of variables allocating products to machines)

    for m in machines:
        prob += lpSum(allocate[p, m] for p in products if m in product_mapping[p]) <= 10

    If there are a large number of products it is much better to iterate through that list once and compile a mapping of machines to products

    machine_mapping = {}
    for p, m_list in product_mapping.iteritems():
         for m in m_list:
             products = machine_mapping.setdefault(m, [])
             products.append(p)
    for m in machines:
        prob += lpSum(allocate[p, m] for p in machine_mapping[m]) <=10
  5. For large lists try to use generator expressions instead of list comprehensions, or in other words return iterators instead of full lists. In Ted's example dict.keys() returns a new list that if the dictionary is big can hurt performance while dict.iterkeys() will return an iterator and will not have to build a new list each time.

    Note

    >>> for key in my_dict:

    and

    >>> for key in my_dict.iterkeys():

    are equivalent

    example 2 (summing odd numbers less then 1000) Bad

    >>> total = 0
    >>> for i in [j for j in range(1000) if j % 2 == 0]:
    ...        total += i

    Better (generator expression)

    >>> total = 0
    >>> for i in (j for j in range(1000) if j % 2 == 0):
    ...        total += i

    Even Better (xrange instead of range)

    >>> total = 0
    >>> for i in (j for j in xrange(1000) if j % 2 == 0):
    ...        total += i

    Best

    >>> total = sum(j for j in xrange(1000) if j % 2 == 0)

Quick note about stress testing on Google Appengine

In a follow up to my last post on GAE I have the following script that hammers a GAE application to make sure it does not fail under load. Note the following:

  1. Change the url for the endpoint to your app.
  2. Change the test to be appropriate for what should be returned from your app.
  3. The test endpoint should be unique (see the fake query string) or google will cache the results and your app will only see about two requests a second.
"""
Stress tests a endpoint for a GAE application
"""
# rate is in requests per second
rate = 5
# test time is in seconds
test_time = 120
# url to hit has to be unique so that the request hits
# the app not the cache
url = 'http://your-app.appspot.com/endpoint?test=%s'

test_string = "This string should be in the html response"

import time
import multiprocessing
import urllib
import random


def test():
    """
    The test function
    """
    url_open = urllib.urlopen(url%random.random())
    if test_string in url_open.read():
        pass
    else:
        print 'Failed'

if __name__ == '__main__':
    processes = []
    start = time.time()
    while time.time() <= start + test_time:
        p = multiprocessing.Process(target=test)
        p.start()
        processes.append(p)
        time.sleep(1.0 / rate)
    for p in processes:
        p.join()
    print 'Tested url %s times' % (test_time * rate)

Auckland NZPUG presentation

Last night I did a short presentation introducing Google App Engine GAE. This presentation is found here the files associated with this presentation are here. This includes a sample GAE application which is setup for testing.

You can not pickle an xml.etree Element

Yes that is right you will get an error that looks like this

PicklingError: Can't pickle <function copyelement at 0x226d140>: it's not found as
__main__.copyelement

Great how pickle doesn't tell you that it is an xml.etree.Element that is causing the problem or what this element is attached to.

I have been using the multiprocessing module that internally uses pickle to transfer inputs and outputs between processes and it will fail with an even more cryptic message if there is an Element object somewhere in the inputs or outputs.

Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 505, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 310, in _handle_tasks
    put(task)
TypeError: expected string or Unicode object, NoneType found

To find this problem I replaced pool.map() with map() and pool.apply_async() with apply. Then I used pickle.dumps() on the inputs and outputs to find what was giving the problem.

pickle.dumps(thingee)
pickle.dumps(a)
pickle.dumps(b)
result = apply(my_func, (thingee, a, b))

Then once I had found the offending object I had to subclass pickle.Pickler to get the debugger to jump in at the right spot.

import pickle
class MyPickler (pickle.Pickler):
    def save(self, obj):
        try:
            pickle.Pickler.save(self, obj)
        except Exception, e:
            import pdb;pdb.set_trace()

Then call it like so

import StringIO
output = StringIO.StringIO()
MyPickler(output).dump(thingee)

And finally I got my answer

Notes on building C extensions with python

Previously, when I have needed to access C libraries with python I have used the ctypes library

However while I have been working on the dippy module I have needed to link it into some fairly complicated C code (being the DIP library).

Dippy is a python extension module that directly uses the python c api. As Qi-Shan Lim and Michael O'Sullivan wrote dippy I will not go into details of its implementation but instead discuss a toy example using Cython.

So taken from the basic Cython tutorial (altered to use setuptools) we start with a hello world example.

Preliminaries

  1. Install cython (on ubuntu $sudo apt-get install cython)
  2. Setup a virtual environment to play with
$ mkdir cython-example
$ cd cython-example
$ virtualenv .  #note you can't use --no-site-packages as you need cython
$ source bin/activate
(cpython-example)$

The Hello World example

Create these files:

helloworld.pyx

print "Hello World"

and setup.py:

#!/usr/bin/env python

from setuptools import setup
from distutils.extension import Extension

import sys
if 'setuptools.extension' in sys.modules:
    m = sys.modules['setuptools.extension']
    m.Extension.__dict__ = m._Extension.__dict__

setup(
    setup_requires=['setuptools_cython'],
    ext_modules = [Extension("helloworld", ["helloworld.pyx"],
                             language="c++")]
)

Then do the following:

(cython-example)$ python setup.py build_ext -i
(cython-example)$ python
>>> import helloworld
Hello World

You will also see a helloworld.c generated by Cython.

Finding python memory leaks with objgraph

I have had a rogue memory leak in one of my programs for a while, but I have now been able to track it down.

It hasn't been an issue until recently when I have been trying to solve a large number of problems. I did some googling and came up with objgraph a module that lets you graphically explore the objects that remain in python's memory.

As python is a garbage collected language memory leaks tend to be caused by one of these reasons:

  • Accidentally adding a reference to objects to something in the global scope so they are never garbage collected
  • Circular references that contain an object with a custom __del__() method
  • Memory leakage in a C extension module
  • some other reasons that I have not encountered :-)

To install objgraph for interactive use in Ubuntu:

$ sudo apt-get python-pygraphviz
$ sudo pip install xdot
$ sudo pip install objgraph

Here is my example using the pulp library:

import objgraph
for i in range(10):
    objgraph.show_growth(limit=3)
    create_and_solve_model())
objgraph.show_growth()
import pdb;pdb.set_trace()

If all was working well the model would have gone out of scope and disappeared by the second call of objgraph.show_growth(). However I get the following:

dict                3951      +301
list                2091      +170
LpVariable          1200      +120
...
(Pdb)

As you can see something has gone wrong and objects are staying in memory. After looking in my code I found a circular reference and deleted the offending line. Memory leak disappeared!

skeleton a template for python projects

I presented a talk on sphinx at the Auckland python users group. Instead of trying to present all the various interactions between sphinx and setup.py. I used skeleton, a tool that can be used to create python projects.

Skeleton uses templates to create a ready made project, similar to pastescript.

I made a fork of this project on github and added a template where sphinx is integrated into the project. This fork is currently available as skeleton_stu on pypi until the changes are merged to the original project.

To create a basic sphinx package:

$pip install skeleton_stu
$skeleton_package_sphinx [your_directory]

Answer a few questions and then you are done :-)

Squarespace website

Well I'm converting my website to use squarespace.

Seems okay but I would like wysiwyg table construction.

random.seed() and python module imports

Interesting factoid on the random library.

I use random.seed() in my tests to get reproducible numbers (to test graphing and stats functions) and I have found the following unexpected behavior.

>>> import random

imports the random library as a singleton therefore, if you write:

>>> random.seed(0)

it will actually set a seed for all code (including library code) that uses the random library. Worse than that if your test code calls a library that uses random it will get a number from that sequence and now the random numbers in your tests are not in the same sequence as they may have been without the library call.

Therefore my blanket recommendation for all code that uses random.seed() is that the import line be changed to the following:

>>> from random import Random
>>> random = Random()

This will give you a new instance of random that will only be used within your module scope, and all previous calls to random.seed() etc will continue to work.

How to root and update your htc hero

I just did this because I was having problems mounting and tethering my hero

  1. Download and install universal android root - remember to allow non-market applications, and use a file manager to install the apk
  2. Download and install Titanium Backup (from the market) - use this to backup your apps and data
  3. Download and install ROM manager (from the market) - this will install a recovery image
  4. Download a rom from http://android-roms.net/hero/ - put this on the root of your sd card and rename update.zip
  5. Use ROM manager to boot to recovery image
  6. Make a Nandroid backup of your current system
  7. Choose a factory reset and wipe dalvik cache in the recovery menu
  8. Install the new rom from update.zip

If anything goes wrong with the update, remove the battery (if your screen is frozen) and recover your original settings with nandroid from the recovery menu.

nosetests function names

Be careful with function names in tests when using nosetests and python

Nose tests automatically runs all functions in discovered files with the word test in them. This can bite you if for instance you put a function build_test_smelter used to define some test data for a subclass of TestCase in a __init__.py file.

As nosetests will run this function as well (without the setUp and tearDown methods, that clean your database) you will end up with two instances of your smelter instead of one. This will not show up if you run the tests in a single file but will show up when you use:

$ bin/nosetests

from the command line as then it will discover that function in your __init__.py file