The Binary Transcendence: python

Showing posts with label python. Show all posts

Wednesday, September 29, 2010

Elementary Analysis of TinyPython Virtual Machine v1.1

OVERVIEW

TinyPython is a minimalist implementation of Python in 64K

code, originally created by Phil Hassey. He blogs at:

http://philhassey.com/blog/ .

It is a parser and byte-code compiler written in

TinyPython itself. It is also fully bootstrapped in the sense that

initially, TinyPy converts a Python script (.py) into a special

TinyPy byte-code format (.tpc), and this generated code is then

passed into a subset of the TinyPython source code called the

Virtual Machine, where the actual execution takes place.

One can even extend the idea that if the VM is compiled into a

low-level format adaptable to a particular microcontroller,

then the VM will reside inside that chip, and any .tpc file can

be downloaded into the chip as its input.

A basic analysis of source code for TinyPy v1.1, excluding the

byte-code compilation phase, was performed. The focus was

on the working of the TinyPy Virtual Machine, which actually

executes the byte-code.

The TinyPython source code used was downloaded from:

http://code.google.com/p/tinypy/downloads/list .

BUILDING TINYPY

Initially, the file listing will look like the following figure:

Fig 1. TinyPy source file contents

Initially, the 'build' folder will be empty. The 'doc',

'examples' and 'modules' folder may or may not contain

any documents, Python scripts and batteries (or modules)

respectively, depending on the downloaded package.

The LICENSE.txt, README.txt, CHANGES.txt and

ROADMAP.txt describe the essential details of TinyPy.

The 'setup.py' contain the initial code to build the TinyPy

from scratch. At the terminal, type as follows:

python setup.py linux

It is implied that you need a Python interpreter available in
your system. The 'linux' option is to specify that the code will

be compiled so as to make it work in a Linux environment.

Now, a new executable 'tinypy' will appear in the 'build' folder.

To fully bootstrap and test TinyPy, give the command as:

python setup.py linux boot

Now, in the 'tinypy' folder shown above, two new

executables, 'tinypy' and 'vm' will appear, which are the

TinyPy parser, and Virtual Machine respectively. It can be

noticed that new .pyc files for the corresponding .py

files have also been generated by the Python interpreter.

The general usage and some options available are listed below:

python setup.py command [option] [module]

• 64 k - build a a 64k version of TinyPy source code
• blob - build a single 'tinypy.c' and 'tinypy.h'

The 'py2bc.py' script is used to convert a user-generated

Python script into its corresponding .tpc file.

python py2bc.py sample.py sample.tpc

Here, tinypy_path is the path (relative to current position)

of either the tinypy executable in the 'build' folder, or the one

in the 'tinypy' folder. 'sample.py' is the name of the user-script.

'sample.tpc' is the name given for the byte-code converted file.

Finally, the generated byte-code (.tpc) is to be passed into the

VM for compilation and execution. Assuming the current
directory as 'tinypy' folder, it is done as:
vm sample.tpc

Or logically,
gcc vmmain.c -lm ./a.out sample.tpc

The 'vmmain.c', present in the 'tinypy' folder, is the main
function of the VM which runs and automatically links all the
other files necessary for the VM. And now the output is

obtained and displayed. For a better picture, the files actually

needed for VM are:

Fig 2. The files needed for the VM

For debugging and understanding the flow of control within

the source code, the 'gdb' and 'ctags' tools were used.
ANALYZING THE VM
A sample program 'sample.py':
   def add(a, b):
   c = a + b
   return(c)
   print(add(1, 9))
gives you the output:

Each time the VM is invoked, a new instance of the VM is

created .

tp_vm *tp = tp_init(argc, argv);

"Each object in tinypy is stored as a union in the C API."

tp_obj is tinypy's object representation.

typedef union tp_obj

{

int type;

tp_number_ number;

struct{int type;int *data;}gci;

tp_string_ string;

tp_dict_ dict;

tp_list_ list;

tp_fnc_ fnc;

tp_data_ data;

}tp_obj;

The field 'type' of the union indicates the type, of the object.

A type value of 1 indicates that a number is being used,

type = 2 indicates object is a string and type = 4 indicates a list.

In the sample program, a function in the VM 'tp_add' does the

job of adding two arguments (numbers, strings, lists) that are

passed to the function as input.

tp_obj tp_add(TP, tp_obj a, tp_obj b)

The function definition of 'tp_add' contains three arguments:

• TP -> the VM instance

• tp_obj a -> the union variable representing the object 'a' in

the sample program

• tp_obj b -> the union variable representing the object 'b' in

the sample program

'a' and 'b' are stored as unions and contains fields for types

such as number, strings, lists, etc.

Consider two simple lists in Python:

a = [1,2,3]

b = [4,5,6]

The VM uses a function 'tp_extend' which returns a union that

contains the new (extended) list. Its 'type' would be 4 indicating

a list. The 'list' structure variable includes a pointer *val that

points to another structure '_tp_list'.

typedef struct _tp_list

{

int gci;

tp_obj *items;

int len;

int alloc;

}_tp_list;

The pointer *items as you could see is of type tp_obj and

de-referencing it would give you the union tp_obj. This

union contains a single element of the final list.

To obtain the next element of the final list:

r.list.val -> items

would give you the address of the union.

Union containing the next element of the list = Address of

the current union + size of each union.

Here:

r.list.val -> items = 0x96ca048 + 10

would give you the union that stores the next element of the

list . The size of the union is 10, in hexadecimal = 16 bytes.

Fig 3. The representation of a list

THE PATH OF 'tp_add'

Fig 4. The backtraced path of 'tp_add'

While debugging, our sample program and one of the

input data too, can be spotted as:

Fig 5. Spotting the sample program

Fig 6. Spotting one of the inputs

The switch() inside 'vm.c' is invoked nine times, one of

which will be the selection of case: 'TP_IADD' and

subsequent execution of 'tp_add'.

Fig 7. The case: 'TP_IADD'

Another trial debugging was done with input file 'testing.tpc',

whose source script was 'testing.py' which just added two

numbers. Only the filename 'testing.tpc' was given to the

debugger.

The VM snapshot taken while debugging, showing the

name of the source script 'testing.py' and the function used

'print(a + b)' stored in the fields of the VM data structures,

and the conclusion to this analysis, and more details can be

found in:

http://github.com/HariJohnKuriakose/tinypython

Wednesday, September 15, 2010

Memoization

Computers are very fast machines. But the computation time
also depend on the algorithm we choose. It can happen that
if a proper algorithm is not chosen, it may take a very long
time for execution.

On writing a simple recursive fibonacci generator function for
the nth element, fib(n), we will get fine results, but only for low
values of n. This is because fib(n) has exponential complexity.
Such functions grow very fast, so fast that even for small
values of `n' (where `n' is the size of the input), our program
will take a long time to finish.

Memoization is the process by which we speed up the situation.
The underlying principle is to create an associative array, say
dict, which contain keys = n, and values = fib(n). Each time a
fib(n) is called, the associative array will be checked whether it
contain the key 'n'. If yes, just return dict[n]. Otherwise, create
a new key: value pair for dict. Thus, the number of recursion will
be drastically reduced, with this look-in-the-lookup-table policy.

Ofcourse, memoization is an upgrade to time compplexity, but
will consume space.
The implementation in Python looked like:

def memoized_fib(x):
if dict.has_key(x):
return dict[x]
else:
dict[x] = fib(x)
return dict[x]

def fib(n):
if n == 0: return 0
elif n == 1: return 1
else: return memoized_fib(n-2) + memoized_fib(n-1)

dict = {0: 0, 1: 1}

In Python, the recursion limit is fixed by the interpreter itself. It can
be found in:
sys.setrecursionlimit(<limit>)

Now, set a very high limit, and run the code. If its high enough, u will
deplete the stack memory size (8 MB) of the C backend of your Python
virtual machine, leading to a segmentation fault there, which is
propagated back to your Python interpreter, crashing Python itself!

Monday, September 6, 2010

Classes in Python

In Python, "classes are objects too". Let me clarify.

Let me define a sample class.
>>> class abc: pass
>>> abc
<class__main__.abc at 0xb76e341c>
It indicates that in Python, classes are not abstract!
They too take memory space.

Lets explore further.
>>> m = abc
>>> m
<class__main__.abc at 0xb76e341c>
>>> m = abc( )
>>> m
<__main__.abc instance at 0xb76e298c>
i.e, in Python, both classes and their objects are concrete.

Now,
>>> abc.p = 10
>>> abc.p
10
Classes can act as objects, as already told above!

Still further,
>>> m = abc()
>>> m.p
10
>>> m.q = 20
>>> abc.q
20
What exactly happened here?

Things to understand:
1. Here, "." works as a search operator
2. On querying m.p, first checks if there's a p attribute in m
3. If not, automatically finds out type of m, then searches for p there.
4. The p attribute here, actually belongs to abc, not m.

Sunday, September 5, 2010

C and Python: A Language Faceoff

1. Data Types
   Python is a good example for a dynamically typed programming
   language. It means that, there are no need for type declarations. Any data
   can be initialized directly and the corresponding type will be automatically
   assigned and understood by the Python interpreter.
   >>>a = 12
   >>>c = 'hello'

   C is a statically typed language. We need to first declare the variable.
   Only then are we allowed to define it. The C compiler will understand
   the type of the data and assign suitable storage space only when we
   declare it.
   $ int a, b=2;
   $ char c = 'h';
   $ a=10;

2. Array Indexing
   The mechanism of array indexing in C is as follows:
   array index is given by 'base address + offset'
   Therefore, how large be the array length, C compiler only needs to add
   the offset to the base address to access the required array location.

   In Python, the situation is different. It is compulsory for its interpreter to
   follow 2 rules:
   i. check whether current array index is out of bound
   ii. access that index location

   In short, its even possible to go out of bound in C. Garbage values in
   that locations will be returned. Thats all. But Python is very
   conservative. You specify an array length, you better stay inside it. No
   bending rules.

   Finally, the outcome. As you can guess, array indexing in C is blazing
   fast. If we work on the same array in Python, we may have to wait.
   Maybe really long.

3. Dynamic Allocations
   On the outset, i take the privilege to assume you know what dynamic
   allocations are.

   They are very troublesome in C. Simply speaking, if you do an
   malloc(), then use the same pointer to do another malloc(), you will
   lose control of the firstly allocated memory block forever. You can't
   use it anymore. So be wary. Thats all you can do.

   For this matter, Python is awesome. It will take care of such
   unhandled allocations of memory automatically. Its called garbage
   collection or automatic memory management. We can say that, each
   allocated block has a reference count, which has the value of the
   number of pointers currently pointing to that particular block. This
   count, is incremented or decremented automatically and accurately.
   There is one consequence, though. The compilation time will be
   non-deterministic because we cant predict when and how Python
   attempts this.

4. Indentation
   In short, indentation is compulsory in Python. The intermezzo coding
   style proposes 4 spaces. While the Google coders are seen to be using
   only 2.

   In C, it is just the other way. Here, it is only a matter of user
   readability. The C compiler doesn't care.

5. Functions
   In Python, all the functions, both built-in and user-defined, are first
   class. A function is said to be first class, if it can be treated just as a
   variable. It means that, in Python, functions can be passed as
   parameters, and returned too. Coool, eh? If so, how about knowing
   that this concept can be extended to creating higher order functions?
   They are functions which operate on other functions, e.g. integration,
   differentiation. Three built-in higher order functions in Python are
   map, filter and reduce.

   In C, its just the contradictory. Functions are functions, variables are
   variables. No mixing up.

6. Pointers
   They are the most awesome weapon in C. You can do powerful things
   with them, that can be either only dreamed of or very hard to
   accomplish in other languages. But as powerful they are, be very
   careful with the memory manipulations you can do with them, or else
   you are a goner.

   They are not as such implemented in Python. But some similarity can be
   felt in some operations that can be performed. For a brief idea, refer to
   "Pointers" in Python.

7. Assignments in Expressions
   In C its possible to write,
   $ int a, b=1;
   $ if ((a=12)>b)
   $ print (" %d \n", a+b);

   Well, its not possible in Python. Only logical equality (==) can occur
   inside an expression.

Saturday, September 4, 2010

The Shebang Line

Well, this was an Easter Egg for me, so I will directly move on to what I
mean, without keeping you bewildered.

In computing, a shebang refers to the symbol '#!'. If it is used in a script,
it must come in the very beginning of the first line itself. Thus, the
common term shebang line.

The shebang line contains the absolute path of the interpreter binary for
the particular script you are running. On running executable script files,
the shebang will be automatically looked for and used, resulting in a 'smart
script', or so we can say.

The shebang line for Python scripts is,

#! /usr/bin python

As you might be wondering, this weird name comes from an inexact
combination of sharp (#) and bang (!), which are the Unix names of
the two characters. The other names are hashbang, hashpllng,
poundbang, and crunchbang. Don't ask me... !!!

Interestingly, the shebang was introduced by Dennis Ritchie at Bell
Laboratories (Edition 8). It was then also added to BSD releases from
Berkeley's Computer Science Releases.

If the absolute path for the interpreter binary remains the same in all the
machines that the script is run, then portability is achieved too.

More Easter Eggs:
1. When portability issues are encountered, the shebang line
    #! /usr/bin/env python
   may provide a solution.
2. The shebang actually represents a magic number in the executable
    file. This magic string is 0x23 0x21. Executable files that do not require
    an interpreter starts with other magic combinations.
3. The shebang line for Perl is,
    #! /usr/bin/perl

Wednesday, September 1, 2010

Python Tutorial, Release 2.7: Abridged (heavily!)

This tutorial is originally written by Guido van Rossum, and edited by Fred
L. Drake, Jr. . It is an excellent one, excluding some specific portions.

I felt the urge to keep a record of all the key points regarding Python, that
i grasped from this tutorial. So here goes..

1. The Python Interpreter
    Interactive mode: Ctrl + D, primary prompt, secondary prompt
    Non-interactive mode: 'python' command, sys.argv[ ]
/usr/lib/python2.6 or /usr/lib/python3.1
2. Executable Python Script
#! /usr/bin/ python
$ chmod +x
3. The Startup File
PYTHONSTARTUP environment variable
os.path(), os.environ.get(), execfile(), .join()
sys.ps1, sys.ps2
4. Basics
#, numbers, , docstring (""' or """),
    sequence data types: str, unicode, list, tuple, buffer, xrange
    string (effects of ' and "), list (don't ever forget slicing!)
    Mechanism of slicing:

Courtesy: Python Tutorial, Release 2.7

be wary of short circuit nature (e.g. and)
if, elif, else, for in , range, break, continue, pass
5. Functions In Python
def <function_name>(<argument(s)>): body
    call by value, first class, scope
built-in higher order functions: map, filter, reduce, etc
default argument values (evaluated only once!)
keyword arguments: function (arg1, *arg2, **arg3)
   *arg2 -> receives a tuple, **arg3 -> receives a dictionary
positional arguments before keyword arguments
unpacking argument lists: *list_object or **dictionary_object
    lambda
6. Intermezzo Coding Style
a, b = 10, 20 is valid (swapping is unimaginably easy!)
4-space indentation
blanklines, comments, docstrings
spacing: a = f(1, 2) + g(3, 4)
self as name for first method argument
7. Functional Programming Tools
list comprehension: >>> vec = [2, 4, 6]
>>> 3*x for x in vec if x > 3
>>> [12, 18]
del (for lists), push(), pop() and collections.deque (for strings)
set (use it to eliminate duplicates on the fly !, thanks to Sumith)
dictionaries (.append(), .extend(), .keys(), .values(), .items())
8. Modules
import <module_name>,
    from <module> import <method>
    from <module> import * (not advised)
the [if__name__ == "__main__":] clause
    PYTHONPATH, sys.path, dir() function, packages
9. Input and Output
    regular expressions (regex)
    sys.stdin(), sys.stdout()
repr(), str(), str.format()
    open(), close(), read(), readline(), readlines(), seek(), write()
pickling: pickle.dump(), pickle.load()
10. Error Handling
try: ... except: ..., try: ... except: ... else: ..., try: except: finally: ...,
try clause: if functions present, tries inside it too
    else clause: to prevent except from catching the error caused
                          by some other piece of code
    finally clause: always executed
                   1. on leaving the try: except:,
   2. when error raised by try: not handled by any except:
   3. when there is an error in the except: itself
    class MyClass (Exception): pass
    raise Exception (arguments)
    except Exception as inst: (arguments stored in instance.args),
with open ('','') as :
11. Classes
   class <class_name>[(<baseclass(es)>)]:pass
   Python classes are objects
   both class and its objects are concrete
     no sense in concept of 'declaration'
     each function attribute has at least one argument named 'self'
   by convention
     __init__, __str__, __eq__, __iter__
   isinstance()
12. Mechanism of for
     this idea can be applied to any iterable in Python

Courtesy: Python Tutorial, Release 2.7

13. Concept of Generators
yield -> return, THEN FREEZE THEN AND THERE UNTIL next()

Courtesy: Python Tutorial, Release 2.7

     "genex" -> generator expression
                           (e.g. generator in a list comprehension.
                               then replace [] with ())
                           >>> sum(i * i for i in range(0, 10))

Monday, August 30, 2010

"Pointers" in Python

Firstly, I want you to understand that, the term "pointers" used here, is a
misnomer.

In Python, there is no particular data type called a pointer; as in C. Even so,
some operations that we can perform in Python, are very similar to the
structure of a pointer.

Ok, I've been bragging till now about pointers, let me tell you what it is.

Visualize that, a new data type that you create has to be stored in a memory location in the hard disk. This memory location, has an address.
In C, a pointer can be defined whose value is such a memory
address. Any valid operation done on this pointer, will be
automatically reflected in the data present in that address.

Now lets see, where this happens in Python.

Lets create a list and copy it.
>>> a = [1,2,3]
>>>b = a
Now lets change an element of a.
>>>a[1] = 10
>>>a
[1,10,3]
Now check the value of b.
>>>b
[1,10,3]

We can see that b has also changed. How did this happen?

When the list a was created, it was stored in a memory location say 2000,
and the value 2000 was put in a. When a was assigned to b, it meant that
now b too contain the value 2000. Then we changed the second element of
a, i.e, the data in the address 2000. Therefore, now if we print the value of
a or b, we will only get the changed list, since both a and b are actually
"pointers" to the memory address 2000.

N. B.
1. Be careful with list assignments!
2. What I discussed above, does not hold for slicing operations on a list.
   When any list is sliced, first a copy of it is created in another memory
    location, and then the operations are performed. Thus, the changes won't
    reflect back in the original list.

Pages

Wednesday, September 29, 2010

Wednesday, September 15, 2010

Monday, September 6, 2010

Sunday, September 5, 2010

Saturday, September 4, 2010

Wednesday, September 1, 2010

Monday, August 30, 2010