If you want to start an argument between two financial data scientists, ask them which coding language they prefer to use: R or Python? If they have a difference of opinion, then a heated and emotional debate will inevitably follow. But who is right?
R is now the best used language for data science
Traditionally R was more common in the data science community, due to its popularity on university campuses. Many neophyte data scientists will therefore have already used R during their Msc or Phd, indeed I first used the language myself when completing my Masters dissertation.
However Python has now caught up. In 2016 R was the most popular language amongst data scientists, with Python coming a close second. By 2018 their positions had been reversed, with two thirds of survey respondents preferring Python versus 49% for R.
R always had the best packages
Historically R has had a wider variety of packages for statistical analysis and visualisation. Popular libraries include dplyr, zoo and ggplot2; and there are dozens more. Python has been slow to catch up, but there are now plenty of available packages for budding data scientists, such as pandas, scipy, and matplotlib.
It's easy to code badly in R
R is widely cited as being difficult to learn if you are used to more mainstream languages. Also, in my experiece it is very easy to write bad code in R, and somewhat easier to write good Python. Object orientated programming (OOP) in R is particularly ugly. The OOP in R is bolted on as an afterthought, rather than being an integral part of the language as in Python.
Both R and Python are dynamically typed languages. This makes them very flexible, but also potentially error-prone. However the weak typing in R is particularly dangerous. R functions have a nasty habit of returning unexpected type of objects, and are subsequently too relaxed about accepting the wrong type as an argument. This makes it difficult to debug code, as the program will often crash thousands of lines after the actual error has occured.
R is slower than Python
Java programmers are always sneering about how slow Python is. But Python is still significantly faster than R; by roughly a factor of four. Both languages can be speeded up to a degree by embedding C or C++ code, but the interface for doing this in R is much clunkier than for Python.
Python is better for trading systems
Being able to use the same language in research and production environments is a major advantage for rapid deployment. I have used R for automated live trading systems in the past, but I would not do so again. The memory management in R is poor, and the typing issues mentioned above can lead to weird errors that are hard to debug. Python is easily up to the job of running live trading strategies, as long as latency is not critical (in which case C++ or Java might be better options). It does have some well known issues, such as the Global Interpreter Lock, but in general it is a pretty robust platform for running production code.
For pure data science R still has a slight edge over Python, although the gap has closed significantly. Nevertheless, the wider applications of Python make it the better all-round choice. If you’re at the start of your career then learning Python will also give you more options in the future.
Robert Carver is the former head of fixed income at quantitative hedge fund AHL. He began using R in 2005, and switched to Python in 2011. Robert is the author of 'Systematic Trading' and 'Smart Portfolios’
Have a confidential story, tip, or comment you’d like to share? Contact: firstname.lastname@example.org in the first instance. Whatsapp/Signal/Telegram also available. Bear with us if you leave a comment at the bottom of this article: all our comments are moderated by human beings. Sometimes these humans might be asleep, or away from their desks, so it may take a while for your comment to appear. Eventually it will – unless it’s offensive or libelous (in which case it won’t.)