String theory, and practice.

Saturday, July 16, 2011

Harvesting Python strings

This article explains how the C++ String Performance Benchmark strings were obtained.

First, Python 2.7.1 was build without threading support.

Then, the string object was hacked to log the method names and operands of the most common str operations, appending them to a single file.

The operations were printed in raw format (not escaped), and separated with an highly unlikely end of record (~ĥùëÔr~ to fit with Perl $/). Strings were themselves quoted {[(<like this>)]}.

Then, the test suite was run. This yielded relatively quickly a 3GB file until I decided to stop. Here are the results:
This file was in turn parsed with a little Perl script that turned \0 charterers into \1's and the end-of-record separators into \0's. Doing some statistics along the way. Picking 1% of them randomly for the actual benchmark (this biases a bit Equality testing...).

I guess I would have to run this on Django or something to have real real-world strings, but well... that's enough for now ;).

You can find more details on the benchmark page.

No comments:

Post a Comment