Wednesday, September 29, 2010

Elementary Analysis of TinyPython Virtual Machine v1.1

OVERVIEW
TinyPython is a minimalist implementation of Python in 64K
code, originally created by Phil Hassey. He blogs at:

It is a parser and byte-code compiler written in
TinyPython itself. It is also fully bootstrapped in the sense that
initially, TinyPy converts a Python script (.py) into a special
TinyPy byte-code format (.tpc), and this generated code is then
passed into a subset of the TinyPython source code called the
Virtual Machine, where the actual execution takes place.

One can even extend the idea that if the VM is compiled into a
low-level format adaptable to a particular microcontroller,
then the VM will reside inside that chip, and any .tpc file can
be downloaded into the chip as its input.

A basic analysis of source code for TinyPy v1.1, excluding the
byte-code compilation phase, was performed. The focus was
on the working of the TinyPy Virtual Machine, which actually
executes the byte-code.

The TinyPython source code used was downloaded from:

BUILDING TINYPY
Initially, the file listing will look like the following figure:
Fig 1. TinyPy source file contents

   Initially, the 'build' folder will be empty. The 'doc',
'examples' and 'modules' folder may or may not contain
any documents, Python scripts and batteries (or modules)
respectively, depending on the downloaded package.
The LICENSE.txt, README.txt, CHANGES.txt and
ROADMAP.txt describe the essential details of TinyPy.
The 'setup.py' contain the initial code to build the TinyPy
from scratch. At the terminal, type as follows:
    python setup.py linux

It is implied that you need a Python interpreter available in 
your system. The 'linux' option is to specify that the code will
be compiled so as to make it work in a Linux environment. 

Now, a new executable 'tinypy' will appear in the 'build' folder.
To fully bootstrap and test TinyPy, give the command as:
    python setup.py linux boot

Now, in the 'tinypy' folder shown above, two new 
executables, 'tinypy' and 'vm' will appear, which are the 
TinyPy parser, and Virtual Machine respectively. It can be 
noticed that new .pyc files for the corresponding .py 
files have also been generated by the Python interpreter.

The general usage and some options available are listed below:
    python setup.py command [option] [module]

64 k - build a a 64k version of TinyPy source code
blob - build a single 'tinypy.c' and 'tinypy.h'



The 'py2bc.py' script is used to convert a user-generated 
Python script into its corresponding .tpc file.
    python py2bc.py sample.py sample.tpc


Here, tinypy_path is the path (relative to current position) 
of either the tinypy executable in the 'build' folder, or the one
in the 'tinypy' folder. 'sample.py' is the name of the user-script.
'sample.tpc' is the name given for the byte-code converted file.


Finally, the generated byte-code (.tpc) is to be passed into the
VM for compilation and execution. Assuming the current 
directory as 'tinypy' folder, it is done as:
    vm sample.tpc

Or logically,

    gcc vmmain.c -lm ./a.out sample.tpc

The 'vmmain.c', present in the 'tinypy' folder, is the main 

function of the VM which runs and automatically links all the
other files necessary for the VM. And now the output is
obtained and displayed. For a better picture, the files actually
needed for VM are:
Fig 2. The files needed for the VM

For debugging and understanding the flow of control within
the source code, the 'gdb' and 'ctags' tools were used.
ANALYZING THE VM
A sample program 'sample.py':
    def add(a, b):
        c = a + b
        return(c)
    print(add(1, 9))
gives you the output:
      10

Each time the VM is invoked, a new instance of the VM is
created .
    tp_vm *tp = tp_init(argc, argv);

"Each object in tinypy is stored as a union in the C API."
tp_obj is tinypy's object representation.
typedef union tp_obj
{
    int type;
    tp_number_ number;
    struct{int type;int *data;}gci;
    tp_string_ string;
    tp_dict_ dict;
    tp_list_ list;
    tp_fnc_ fnc;
    tp_data_ data;
}tp_obj;

The field 'type' of the union indicates the type, of the object.
A type value of 1 indicates that a number is being used,
type = 2 indicates object is a string and type = 4 indicates a list.

In the sample program, a function in the VM 'tp_add' does the
job of adding two arguments (numbers, strings, lists) that are
passed to the function as input.
     tp_obj tp_add(TP, tp_obj a, tp_obj b)

The function definition of 'tp_add' contains three arguments:
   • TP -> the VM instance
   • tp_obj a -> the union variable representing the object 'a' in
                        the sample program
   • tp_obj b -> the union variable representing the object 'b' in
                        the sample program

'a' and 'b' are stored as unions and contains fields for types
such as number, strings, lists, etc.

Consider two simple lists in Python:
    a = [1,2,3]
    b = [4,5,6]

The VM uses a function 'tp_extend' which returns a union that
contains the new (extended) list. Its 'type' would be 4 indicating
a list. The 'list' structure variable includes a pointer *val that
points to another structure '_tp_list'.
    typedef struct _tp_list
    {
        int gci;
        tp_obj *items;
        int len;
        int alloc;
    }_tp_list;

The pointer *items as you could see is of type tp_obj and
de-referencing it would give you the union tp_obj. This
union contains a single element of the final list.

To obtain the next element of the final list:
    r.list.val -> items

would give you the address of the union.
Union containing the next element of the list = Address of 
the current union + size of each union.
   Here:
    r.list.val -> items = 0x96ca048 + 10

would give you the union that stores the next element of the
list . The size of the union is 10, in hexadecimal = 16 bytes.
Fig 3. The representation of a list

THE PATH OF 'tp_add'
Fig 4. The backtraced path of 'tp_add'

   While debugging, our sample program and one of the 
input data too, can be spotted as:
Fig 5. Spotting the sample program
Fig 6. Spotting one of the inputs

   The switch() inside 'vm.c' is invoked nine times, one of
which will be the selection of case: 'TP_IADD' and 
subsequent execution of 'tp_add'.
Fig 7. The case: 'TP_IADD'

Another trial debugging was done with input file 'testing.tpc', 
whose source script was 'testing.py' which just added two 
numbers. Only the filename 'testing.tpc' was given to the 
debugger.

The VM snapshot taken while debugging, showing the 
name of the source script 'testing.py' and the function used
'print(a + b)' stored in the fields of the VM data structures,
and the conclusion to this analysis, and more details can be
found in:

1 comment: