Tuesday
Mar172015

Caching in transit-cljs

So I've been doing alot of coding in clojurescript lately and I had to do some work with transit-cljs.

One of the things I had to test was if I could do comprehensive caching and de-duping of objects serialized with transit-cljs. I based my tests on transit-js-caching but this article has an example in javascript and not clojurescript. So I needed to convert it. Here is how I did it but you will need to read the previous article for the motivation of each step.

1) include the transit-cljs library in your code

 (ns main.transit
    (:require [cognitect.transit :as t]))

2) define an Object in clojurescript

 (defrecord Point [x y])

you can also use deftype but defrecord provides more helper functions like equality etc

3) define a custom write handler with transit-cljs and a function that uses it

 (def PointHandler 
   (t/write-handler (fn [v, h] "point") (fn [v, h] [v.x v.y])))

 (def writer (t/writer "json" {:handlers {Point PointHandler}}))

 (defn write [x]
   (t/write writer x))

3) use this writing function

 (def p (Point. 1 2))

 (print (write p))
 (print (write [p p p p]))

gives

 "[\"~#point\",[1,2]]"
 "[[\"~#point\",[1,2]],[\"^0\",[1,2]],
  [\"^0\",[1,2]],[\"^0\",[1,2]]]"

see by default transit-cljs caches the keys in the JSON

4) create a custom reader

 (defn read [x]
    (t/read (t/reader "json" 
                      {:handlers {"point" 
                                 (fn [v] 
                                     (Point. (aget v 0) 
                                             (aget v 1)))}})
            x))

5) now we can do a roundtrip!

 (print (= p (read (write p))))
 (print (= [p p p p] (read (write [p p p p]))))

 true
 true

6) How do we write a caching writer

 (defn caching-point-handler
   ([] (caching-point-handler (atom {})))
   ([cache]
    (t/write-handler 
      (fn [v, h] (if (get @cache v) 
                   "cache" 
                   "point")) 
      (fn [v, h] (let [id (get @cache v)] 
                   (if (nil? id)
                     (do 
                       (swap! cache 
                              #(assoc % v (count %)))
                       [v.x v.y])
                     id))))))

 (defn c-writer 
   []
   (t/writer "json" {:handlers {Point (caching-point-handler)}}))

 (defn c-write [x]
   (t/write (c-writer) x))

7) what does this produce

 (print (c-write p))
 (print (c-write [p p p p]))

 "[\"~#point\",[1,2]]"
 "[[\"~#point\",[1,2]],[\"~#cache\",0],
  [\"^1\",0],[\"^1\",0]]"

note that the second item is cached and the "cache" keyword is replaced by ^1 in the third item

8) what does the reader look like

 (defn c-read [x]
   (let [cache (atom {})] 
     (t/read (t/reader "json" 
                       {:handlers {"point" 
                                   (fn [v]
                                     (let [point (Point. 
                                                    (aget v 0) 
                                                    (aget v 1))]
                                       (swap! cache 
                                              #(assoc % 
                                                (count %) point))
                                       point))
                                   "cache"
                                   (fn [v]
                                     (get @cache v))}})
             x)))

9) now we can roundtrip

 (print (= p (c-read (c-write p))))
 (print (= [p p p p] (c-read (c-write [p p p p]))))))

 true
 true
Friday
Feb032012

My top n tips for python coding in Optimisation

  1. Always write code for readability first. Pre-optimisation is the root of a lot of evil and often makes the code more difficult to improve later. see http://chrisarndt.de/talks/rupy/2008/output/slides.html for what is pythonic The python language is really been written to make the clearest code the most efficient see Ted's example where

       >>> for key in my_dict:
    looks better and is more efficient then
        >>> for key in my_dict.keys():

    which is inefficient as it creates a new list instead of returning an iterator

  2. If your code is slow use a profiler to find out why before you make changes. Either use the logging module and print timestamps or use the cProfile module. I can't tell you the number of times I have assumed the slow code was for one reason and then found it was another. Also http://code.google.com/p/jrfonseca/wiki/Gprof2Dot is an excellent tool for making pretty graphs

  3. Rigorously validate any optimisation. If your changes don't speed up the code revert them, this is a sliding scale for readability as above, so if you change improves readability but does not change the execution time leave it in. If your change makes the code impossible to read and understand, use lots of comments and only keep it if the speed increase is more than 30% or one minute.

  4. (actual python hints start here) Too much filtering I have seen this a number of times when a using list comprehensions the programmer filters the same list multiple times. For instance in a machine scheduling problem (written in pulp, product_mapping is a list of machines a product can be made on and allocate is a dictionary of variables allocating products to machines)

       
       for m in machines:
           prob += lpSum(allocate[p, m] for p in products if m in product_mapping[p]) <= 10
    

    If there are a large number of products it is much better to iterate through that list once and compile a mapping of machines to products

       
       machine_mapping = {}
       for p, m_list in product_mapping.iteritems():
            for m in m_list:
                products = machine_mapping.setdefault(m, [])
                products.append(p)
       for m in machines:
           prob += lpSum(allocate[p, m] for p in machine_mapping[m]) <=10

  5. For large lists try to use generator expressions instead of list comprehensions, or in other words return iterators instead of full lists. In Ted's example dict.keys() returns a new list that if the dictionary is big can hurt performance while dict.iterkeys() will return an iterator and will not have to build a new list each time. Note

       >>> for key in my_dict:
    
    and
       >>> for key in my_dict.iterkeys():
    
    are equivalent

    example 2 (summing odd numbers less then 1000) Bad

          >>> total = 0
          >>> for i in [j for j in range(1000) if j % 2 == 0]:
          ...        total += i
    
    Better (generator expression)
          >>> total = 0
          >>> for i in (j for j in range(1000) if j % 2 == 0):
          ...        total += i
    
    Even Better (xrange instead of range)
          >>> total = 0
          >>> for i in (j for j in xrange(1000) if j % 2 == 0):
          ...        total += i
    
    Best
          >>> total = sum(j for j in xrange(1000) if j % 2 == 0)
    

    This is only an issue with large lists n >= 1000000 in this case

          >>> from timeit import timeit
          >>> timeit('sum([j for j in range(1000000) if j % 2 == 0])', number=100)
          27.90494394302368
          >>> timeit('sum(j for j in range(1000000) if j % 2 == 0)', number=100)
          13.040030002593994
          >>> timeit('sum([j for j in xrange(1000000) if j % 2 == 0])', number=100)
          15.114178895950317
          >>> timeit('sum(j for j in xrange(1000000) if j % 2 == 0)', number=100)
          10.272619009017944

Monday
Jan232012

Quick note about stress testing on Google Appengine

In a follow up to my last post on GAE I have the following script that hammers a GAE application to make sure it does not fail under load. Note the following:

  1. Change the url for the endpoint to your app.
  2. Change the test to be appropriate for what should be returned from your app.
  3. The test endpoint should be unique (see the fake query string) or google will cache the results and your app will only see about two requests a second.
"""
Stress tests a endpoint for a GAE application
"""
# rate is in requests per second
rate = 5
# test time is in seconds
test_time = 120
# url to hit has to be unique so that the request hits 
# the app not the cache
url = 'http://your-app.appspot.com/endpoint?test=%s'

test_string = "This string should be in the html response"

import time
import multiprocessing
import urllib
import random


def test():
    """
    The test function
    """
    url_open = urllib.urlopen(url%random.random())
    if test_string in url_open.read():
        pass
    else:
        print 'Failed'

if __name__ == '__main__':
    processes = []
    start = time.time()
    while time.time() <= start + test_time:
        p = multiprocessing.Process(target=test)
        p.start()
        processes.append(p)
        time.sleep(1.0 / rate)
    for p in processes:
        p.join()
    print 'Tested url %s times' % (test_time * rate)

Thursday
Jan192012

Auckland NZPUG presentation

Last night I did a short presentation introducing Google App Engine GAE. This presentation is found here the files associated with this presentation are here. This includes a sample GAE application which is setup for testing.

Monday
Oct312011

You can not pickle an xml.etree Element

Yes that is right you will get an error that looks like this

PicklingError: Can't pickle <function copyelement at 0x226d140>: it's not found as
__main__.copyelement

Great how pickle doesn't tell you that it is an xml.etree.Element that is causing the problem or what this element is attached to.

I have been using the multiprocessing module that internally uses pickle to transfer inputs and outputs between processes and it will fail with an even more cryptic message if there is an Element object somewhere in the inputs or outputs.

Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 505, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 310, in _handle_tasks
    put(task)
TypeError: expected string or Unicode object, NoneType found

To find this problem I replaced pool.map() with map() and pool.apply_async() with apply. Then I used pickle.dumps() on the inputs and outputs to find what was giving the problem.

pickle.dumps(thingee)
pickle.dumps(a)
pickle.dumps(b)
result = apply(my_func, (thingee, a, b))

Then once I had found the offending object I had to subclass pickle.Pickler to get the debugger to jump in at the right spot.

import pickle
class MyPickler (pickle.Pickler):
    def save(self, obj):
        try:
            pickle.Pickler.save(self, obj)
        except Exception, e:
            import pdb;pdb.set_trace()

Then call it like so

import StringIO
output = StringIO.StringIO()
MyPickler(output).dump(thingee)

And finally I got my answer