'java' is not recognized as an internal or external command, operable program or batch file.
To check if Java is available and find it’s version, open a Command Prompt and type the following command. So it is quite possible that a required version (in our case version 7 or later) is already available on your computer. Let’s first check if they are already installed or install them and make sure that PySpark can work with these two components. PySpark requires Java version 7 or later and Python version 2.6 or later. The official Spark documentation does mention about supporting Windows. So I had to first figure out if Spark and PySpark would work well on Windows. Often times, many open source projects do not have good Windows support. In case you need a refresher, a quick introduction might be handy. You do not have to be an expert, but you need to know how to start a Command Prompt and run commands such as those that help you move around your computer’s file system. I am also assuming that you are comfortable working with the Command Prompt on Windows.
So the screenshots are specific to Windows 10. In this post, I describe how I got started with PySpark on Windows. Spark supports a Python programming API called PySpark that is actively maintained and was enough to convince me to start learning PySpark for working with big data.
#Tutorial how to install pyspark code#
While I had heard of Apache Hadoop, to use Hadoop for working with big data, I had to write code in Java which I was not really looking forward to as I love to write code in Python.
I decided to teach myself how to work with big data and came across Apache Spark.