Wordcount project in Spark with JAVA using Maven and Gradle

Follow the steps for Wordcount project in Spark with JAVA using Maven and Gradle

Download winutils.exe from https://github.com/steveloughran/winutils/tree/master/hadoop-3.0.0/bin

Move it in F:\BigData\hadoop\bin

Download pre-built Spark from http://spark.apache.org/downloads.html

Extract and move to F:\BigData

Set Environment Variables:

1. SPARK_HOME - F:\BigData\spark-3.0.1-bin-hadoop2.7

2. HADOOP_HOME - F:\BigData\hadoop

Path:

1. %SPARK_HOME%\bin

2. %HADOOP_HOME%\bin

Confirm installation via CMD

Enter: spark-shell

It should show the version of spark as well as shell should start

Set up the Hadoop Scratch directory

Create the following folder: C:\tmp\hive


Navigate to F:\BigData\hadoop\bin

Set permissions by typing 

winutils.exe chmod -R 777 C:\tmp\hive


For Maven project-

Create new project: D:\Codes\Spark\First project

Create class SimpleApp.java in D:\Codes\Spark\First project\src\main\java


/* SimpleApp.java */

import org.apache.spark.api.java.*;

import org.apache.spark.SparkConf;

import org.apache.spark.api.java.function.Function;


public class SimpleApp {

  public static void main(String[] args) {

    String logFile = "F:\\BigData\\spark-3.0.1-bin-hadoop2.7\\README.md"; // Should be some file on your system

    SparkConf conf = new SparkConf().setAppName("Simple Application");

    JavaSparkContext sc = new JavaSparkContext(conf);

    JavaRDD<String> logData = sc.textFile(logFile).cache();


    long numAs = logData.filter(new Function<String, Boolean>() {

      public Boolean call(String s) { return s.contains("a"); }

    }).count();


    long numBs = logData.filter(new Function<String, Boolean>() {

      public Boolean call(String s) { return s.contains("b"); }

    }).count();


    System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);

  }

}


Create file pom.xml in D:\Codes\Spark\First project

<!-- pom.xml -->

 <project>

  <groupId>edu.berkeley</groupId>

  <artifactId>simple-project</artifactId>

  <modelVersion>4.0.0</modelVersion>

  <name>Simple Project</name>

  <packaging>jar</packaging>

  <version>1.0</version>

  <dependencies>

<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->

<dependency>

    <groupId>org.apache.spark</groupId>

    <artifactId>spark-core_2.12</artifactId>

    <version>3.0.1</version>

</dependency>


  </dependencies>

</project>


Navigate to D:\Codes\Spark\First project in CMD

Execute: 'mvn package' to download dependencies and run project 

The output will be 

Lines with a: 64, lines with b: 32



For Gradle Project using Eclipse

Create spring starter project -> JAVA - 8

/*SparkFirstProject3Application.java*/

package com.example.demo;


import org.apache.spark.api.java.*;

import org.apache.spark.SparkConf;

import org.apache.spark.api.java.function.Function;


// https://telegraphhillsoftware.com/starting-the-spark/

public class SparkFirstProject3Application {

  public static void main(String[] args) {

    String logFile = "F:\\BigData\\spark-3.0.1-bin-hadoop2.7\\README.md"; // Should be some file on your system

    SparkConf conf = new SparkConf().setAppName("Simple Application");

    JavaSparkContext sc = new JavaSparkContext(conf);

    JavaRDD<String> logData = sc.textFile(logFile).cache();


    long numAs = logData.filter(new Function<String, Boolean>() {

      public Boolean call(String s) { return s.contains("a"); }

    }).count();


    long numBs = logData.filter(new Function<String, Boolean>() {

      public Boolean call(String s) { return s.contains("b"); }

    }).count();


    System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);

  }

}


/* build.gradle */

plugins {

id 'org.springframework.boot' version '2.4.0'

id 'io.spring.dependency-management' version '1.0.10.RELEASE'

id 'java'

}


group = 'com.example'

version = '0.0.1-SNAPSHOT'

sourceCompatibility = '1.8'


repositories {

mavenCentral()

}


dependencies {

// https://mvnrepository.com/artifact/org.apache.spark/spark-core

compile group: 'org.apache.spark', name: 'spark-core_2.12', version: '3.0.1'

}


test {

useJUnitPlatform()

}


Set run configuration(Applicable for Maven project too):

VM Arguments: -Dspark.master=local



If there is any error:

Download SBT from https://www.scala-sbt.org/download.html

Install it and set SBT_HOME as an environment variable with value as <<SBT PATH>>.


Additional Resource: 

1. https://telegraphhillsoftware.com/starting-the-spark/

2. https://stackoverflow.com/questions/25481325/how-to-set-up-spark-on-windows

3. https://www.baeldung.com/apache-spark

4. https://github.com/apache/spark/blob/v2.3.1/examples/src/main/java/org/apache/spark/examples/JavaWordCount.java

5. https://programmersought.com/article/63834592230/

6. https://spark.apache.org/docs/2.3.1/streaming-programming-guide.html

7. https://spark.apache.org/docs/2.3.1/rdd-programming-guide.html


Comments

Popular posts from this blog

MS Word Page numbering for Reports

Numpy and Pandas initialization and modification