/ ai

Data science on Windows: things to consider

I somehow got pulled into a heated discussion on LinkedIn on whether the choice of operating system affects your ability to do data science. Predictably, it became a Linux vs Microsoft Windows debate.

First, some background - so you don't think this is some uninformed Microsoft bashing:

I definitely have more experince with Windows than I have with Linux. I spent 15+ years using, and developing for, various flavors of Windows and Microsoft dev and power tools. I've written desktop apps, server software, embedded applications, and various server plugins for IIS, Exchange, and Media Server. At my previous companies, we even co-developed tools with Microsoft. I've installed, configured, optimized, secured, and diagnosed hundreds of Windows machines (possibly every version of OS that Microsoft ever rolled out). Our servers were either FreeBSD or Windows. I did run Linux on my PC from the Slackware days, but that would typically be a dual-boot system, and Linux wasn't the 1st menu item.

So, Windows was my primary OS. I became very good at fixing things, my Windows machines always ran well, and I loved that OS - or so I thought.

Look familiar?

Eventually, Linux became more mature, and I found myself using it more and more for dev and server admin tasks. Having an OS that "just works", and totally taking stability and uptime for granted was incredibly liberating. Productivity went through the roof.

And, once you have fully experienced the two worlds - that's when you get mad. Mad, because of all the countless hours (years) you and your colleagues spent learning and "mastering" absolutely meaningless, ridiculous tasks just to make things "Okay":

  • doing "clean" re-installs (it's the only way to be sure)
  • Registry hacking
  • hardening everything, always, because the default options would get you owned
  • rebooting
  • deleting various temp files
  • removing apps and utils that came with the OS
  • rebooting
  • MSDN and product keys
  • killing processes in Task Manager
  • staring at Perf Mon to get a sense what's going on
  • Powershell
  • progress bars
  • OS activation
  • VBScript (plus an occasional GUI in VB to track the killer's IP)

Now, if you are the designated "computer guy" among your friends and family, don't forget the hundreds of hours you probably spent making Malwarebytes bootable media, and also attempting to remove browser toolbars, or "programs" that refuse to uninstall.

... add to that - if you ran a sizeable operation - $ hundreds of thousands that you spent on various OS and related software licenses.

So - whenever I see aspiring data scientists wondering whether their choice of operating system matters, I feel it is my duty to point out the following:

  1. System stability matters more than you think - an application that hangs mid-process can be a real productivity killer. Linux will be respectful of your time.
  2. Before you get to any real data science "tools", quick data manipulation and pre-processing (text files, images, logs) is very important. You will be amazed by what you will be able to accomplish just with awk, grep, and sed on Linux.
  3. If you design your workflow for Linux, it will run anywhere, on any cloud provider. Linux also dominates supercomputers, if you ever decide to go big.
  4. Linux will make you a better developer. It won't let you get sloppy with data. This topic deserves its own blog post.
  5. Windows is case-insensitive, and that alone will eventually get you in trouble. If you don't believe me, ask Microsoft.
  6. Caffe, an awesome deep learning framework from Berkeley, installs easily on Linux (many options!). Windows support is experimental.
  7. No latency. Things are generally fast (again, this probably requires its own article)
  8. The Terminal. You will be surprised just how aware and productive you are once it becomes your primary environment. See #2 above.
  9. Most libraries used in data research are primarily Unix-based.
  10. Python, with all its dependancies, is much easier to install and run on Linux.
  11. When you start playing with CNNs, you will discover that some models may take hours or even days to train - so being able to monitor and control your hardware resources is very important. Linux makes that easy.
  12. if you do run into a problem on Linux, the solution is likely to be the top search result on Google, and the community is very supportive.

So, what's if you don't have any experience with Linux - can't you just forget Linux and terminal, set up a Windows machine, install Python, R, and a bunch of GUI tools - and get into data science? You can, but you will regret that decision later.