Is Python really the best language for data science in finance?
Programming languages… How many of us haven’t witnessed debates on advantages of one programming language over another? These debates are at least as common as those on the relative merits of Emacs versus Vim or tabs versus spaces (the author has even witnessed a physical fight which tried – but failed – to resolve this age-old question).
Still, the question, “Which programming language shall I use?” is not just about aesthetics. Make a bad choice, and it will come back to haunt you at later stages of the project.
Many programmers, especially the very smart ones, having sampled the programming languages created by humanity to date, may come to the conclusion that none of them suits their needs. They then decide to embark on an adventure: write a new programming language. These geniuses often hide from themselves the true reason for doing so: writing a programming language is fun. Programming languages are usually initiated by individuals: APL by Kenneth E. Iverson, C by Dennis Ritchie, C++ by Bjarne Stroustrup, Java by James Gosling, kdb+/q by Arthur Whitney, LISP by John McCarthy, Perl by Larry Wall, Python by Guido van Rossum… And yet much of their success is determined by concerted efforts of the respective programming communities.
We now live in the age of data science and machine learning. The data scientist’s primary goal is to discover hidden relationships in a dataset – a collection of observations or readings – be it stock prices, medical records, or lists of insurance claims. Speed of development and convenience are of the essence. Python’s syntax is very terse (just think of list comprehensions!), yet natural and readable. It’s hardly surprising that Python is often the data scientist’s weapon of choice.
Many machine learning algorithms are easy to use but difficult to implement. It would be naïve (and wasteful) for the data scientist to implement them themselves: some things are best left to experts. Usually these algorithms come packaged in reusable libraries. Python is known for the abundance of excellent libraries backed by large communities of programmers: NumPy for dealing with multidimensional arrays, SciPy for linear algebra and scientific computing, Matplotlib for visualisation, Pandas for time series data (and most of the data in finance comes in the form of time series), Keras for neural networks, to name but a few. In data science Python has few competitors except, perhaps, R, which is known for its excellent statistical libraries.
Software engineers (rather than data scientists), who develop large, robust, industrial-grade software systems, will probably exclaim at this stage: but Python is slow and unsafe! Slow, because the Global Interpreter Lock (GIL) prevents multiple threads from executing Python bytecode at once. Unsafe, because Python is dynamically, rather than statically, typed, and lacks the compile-time type checks that prevent users from running nonsensical code - the type checks that would be afforded by the stipulation of data types in function signatures. In Python, you can pass just about anything to a function: the code will run as long as the object passed to the function supports all method signatures and attributes expected of that object at run time. This laissez-faire approach is known as duck typing in honour of a phrase by the Indiana poet James Whitcomb Riley: “When I see a bird that walks like a duck and swims like a duck and quacks like a duck, I call that bird a duck.”
Pythonistas will reply: true, but Python is a perfect language to write wrappers around libraries written in other languages, safer and more performant, so it is often used as a kind of programming “glue”.
Indeed, Python’s strengths are, dialectically, also its weaknesses. The obsessive type-safety of members of the C-family programming languages, such as C++, Java, and C#, makes them more cumbersome for data science and quick prototyping, but makes it easier to write boringly robust (and sometimes even beautiful) systems that function well under stress in production.
Nothing beats the speed of C++ (apart from perhaps raw C, which is even closer to the metal), but its speed comes at a price: the need for complex, labour-intensive debugging of memory allocation. While the author himself is a C++ programmer, he would probably choose Java and C# when not writing a low-latency trading system.
In our trading systems, we usually use Python for data science. The models are prototyped, calibrated, and tested in Python, the results are then passed on to a production system, which is implemented in Java. This division of labour between Python or R and a C-family language is common among quant teams. The creators of the Julia programming language are attempting to combine the merits of Python/R for data science and prototyping and Java-like languages for production. This is a noble and challenging effort, and we are watching it with interest.
There are other programming languages, which we think a good data scientist should know. One of them is kdb+/q. To be more precise, q is the programming language and kdb+ is a database implemented on top of it. kdb+/q is irreplaceable when a data scientist is dealing with huge – tens of millions of rows upwards – datasets, and needs to make sense of them quickly. It is also used to power data captures in environments where data arrives in real time, such as algorithmic trading.
There are practical considerations to take into account when choosing a programming language, not only aesthetics. And while it takes relatively little time to learn the syntax of the language, time and exercise are required to become fluent in it. In this sense, programming is a bit like playing chess: it is easy to learn the rules of the game, but difficult to become a master. Until then, your best bet is to learn Python, and to keep repeating: “The rain in Spain stays mainly in the plain” – or hope for a miracle.
Have a confidential story, tip, or comment you’d like to share? Contact: sbutcher@efinancialcareers.com in the first instance. Whatsapp/Signal/Telegram also available.
Bear with us if you leave a comment at the bottom of this article: all our comments are moderated by human beings. Sometimes these humans might be asleep, or away from their desks, so it may take a while for your comment to appear. Eventually it will – unless it’s offensive or libelous (in which case it won’t.)
Photo by Michael Dziedzic on Unsplash