PyPy's number crunching optimization : just-in-time superword parallelism

Plangger, Richard

doi:10.34726/hss.2015.29423

Record link:

https://doi.org/10.34726/hss.2015.29423
http://hdl.handle.net/20.500.12708/4101

Title:

PyPy's number crunching optimization : just-in-time superword parallelism

Citation:

Plangger, R. (2015). PyPy’s number crunching optimization : just-in-time superword parallelism [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2015.29423

reposiTUm DOI:

10.34726/hss.2015.29423

CatalogPlus:

AC12685752

Publication Type:

Thesis - Diplomarbeit

Language:

English

Authors:

Plangger, Richard

Advisor:

Krall, Andreas

Organisational Unit:

E185 - Institut für Computersprachen

Date (published):

2015

Number of Pages:

Keywords:

SIMD; Vectorization; PyPy; Python; Trace; SSE; Optimization; Compiler; JIT; Just-In-Time; Virtual Machine; VM; Scripting language; NumPy

Abstract:

PyPy ist eine weitbekannte virtuelle Machine für die Programmiersprache Python. Anstatt ausschließlich den Quellcode im Bytecode Format zu interpretieren, stellt PyPy einen Just-In-Time JIT Compiler zur Verfügung. Außergewöhnlich ist auch die Implementierungssprache RPython der virtuellen Maschine (VM), ein statisch getyptes Subset von Python. Sowohl der Garbage Collector als auch der JIT compiler können zu jeder VM, die in RPython geschreiben ist, generiert werden. Das macht PyPy sowohl zu einer attraktiven Platform für dynamische, funktionale und logikorientierte Sprachen als auch für Instructionset Simulatoren. In den letzten Jahren wurden Befehlsätze wie SIMD (Single Instruction Multiple Data) in Prozessoren eingebaut. Diese sind nicht nur für multimedia Applikationen nützlich, sondern auch für wissenschaftliche Berechnungen und Simulationen. In der Theorie bieten diese Instruktionen einen Geschwindigkeitsvorteil, der sich linear zur Größe des Vektorregisters verhält. Operationen auf einfachen Fließkommazahlen (32-bit) können dadurch im besten Fall (128-bit große Vector Register) vier mal so schnell ausgeführt werden. Da Python einfach zu erlernen ist und sich ideal für Datenverarbeitung eignet, wurden nach und nach Bibliotheken gebaut, die wissenschaftliche Berechnungen erleichtern und beschleunigen. NumPy ist ein Beispiel einer weit verwendeten Bibliothek. Während der Ausführung einer numerischen Berechnung müssen viele Schritte abgearbeitet werden. Dieser interpretative Mehraufwand verlängert die numerische Berechung unnötig. Um diese extra Schritte zu beseitigen wird die Berechnung in einer statisch getypten System-Programmiersprache geschrieben. Das Program wird vor der Ausführung übersetzt und während der Laufzeit der VM aufgerufen. In dieser Diplomarbeit wird der neue Auto-Vektorisier von PyPy vorgestellt, der Programfragmente automatisch transformiert um SIMD Instruktionen nützen. Die Optimierung ist weder an die NumPy Bibliothek gebunden, noch an eine spezifische Hardwareplattform. Außerdem kann jede VM, die in RPython geschrieben ist, diese Optimierung nützen. Eine empirische Evaulation zeigt wie gut der Vektorisierer auf der x86 Hardwarearchitektur Programgfragmente transformieren kann. Die Ergebnisse zeigen, dass es möglich ist, SIMD Instruktionen auch in einer Dynamischen Sprache zu nützen um Berechnungen zu beschleunigen. Die Komplexität der Transformation ist handhabbar und schnell genug um während der Laufzeit Schleifen zu optimieren.

PyPy is a widely known virtual machine for the Python programming language. Opposed to the standard implementation (CPython), it includes a tracing just-in-time (JIT) compiler. The implementation language is a statically typed subset of Python called Restricted Python (RPython). RPython is an abstraction for byte code interpreters and is able to automatically generate a garbage collector and a tracing JIT compiler. Thus it is not only used for PyPy but, many other dynamic/functional/logic oriented interpreters or instruction set simulators. In the last decade new Single Instruction Multiple Data (SIMD) instruction sets where built into processors to speed up multimedia applications. They are not only useful for multimedia applications but also for scientific applications. The speedup is linearly correlated to the size of the vector register. In theory, given a single precision floating point (32-bit) operation in a loop, in the best case the loop executes the body four times faster using 128-bit vector registers. Recent developments in scientific computing have drawn attention to libraries (e.g NumPy) for numerical computations. NumPy and others currently remove the interpreter overhead of numerical computations by writing the critical routine in a low level language. They are compiled to the host computers architecture ahead of time. At runtime the language interpreter invokes the foreign function compiled earlier. Since NumPy is a commonly used library, PyPy rewrote parts of NumPy and included it in the standard library. This setup renders most of the critical loops as normal program loops instead of foreign functions. In this thesis the new auto vectorizer built in to the RPython optimizing backend is presented. It uses only the linear sequence of instructions of a trace to find parallelism. Dependency information is gathered and used to reschedule some of the instructions as vector statements. This optimization is neither tailored for the NumPy library nor for a specific hardware architecture. Every interpreter written in RPython can benefit from the new optimization. To empirically evaluate the optimization, the x86 assembler backend has been extended to emit SSE4 vector instructions for the optimized traces. The evaluation shows that it is indeed possible to leverage the speed gain SIMD instruction sets offer and that vectorizing traces at runtime is a feasible optimization technique. This does not only prove to work on NumPy traces, but also for Python loops. The implementation is not very complex and the optimization time is reasonably fast.

Additional information:

Zusammenfassung in deutscher Sprache

License:

In Copyright

Appears in Collections:

Thesis