Wordcount project in Spark with JAVA using Maven and Gradle
Follow the steps for Wordcount project in Spark with JAVA using Maven and Gradle
Download winutils.exe from https://github.com/steveloughran/winutils/tree/master/hadoop-3.0.0/bin
Move it in F:\BigData\hadoop\bin
Download pre-built Spark from http://spark.apache.org/downloads.html
Extract and move to F:\BigData
Set Environment Variables:
1. SPARK_HOME - F:\BigData\spark-3.0.1-bin-hadoop2.7
2. HADOOP_HOME - F:\BigData\hadoop
Path:
1. %SPARK_HOME%\bin
2. %HADOOP_HOME%\bin
Confirm installation via CMD
Enter: spark-shell
It should show the version of spark as well as shell should start
Set up the Hadoop Scratch directory
Create the following folder: C:\tmp\hive
Navigate to F:\BigData\hadoop\bin
Set permissions by typing
winutils.exe chmod -R 777 C:\tmp\hive
For Maven project-
Create new project: D:\Codes\Spark\First project
Create class SimpleApp.java in D:\Codes\Spark\First project\src\main\java
/* SimpleApp.java */
import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;
public class SimpleApp {
public static void main(String[] args) {
String logFile = "F:\\BigData\\spark-3.0.1-bin-hadoop2.7\\README.md"; // Should be some file on your
system
SparkConf conf = new SparkConf().setAppName("Simple Application");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> logData = sc.textFile(logFile).cache();
long numAs = logData.filter(new Function<String, Boolean>() {
public Boolean call(String s) { return s.contains("a"); }
}).count();
long numBs = logData.filter(new Function<String, Boolean>() {
public Boolean call(String s) { return s.contains("b"); }
}).count();
System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
}
}
Create file pom.xml in D:\Codes\Spark\First project
<!-- pom.xml -->
<project>
<groupId>edu.berkeley</groupId>
<artifactId>simple-project</artifactId>
<modelVersion>4.0.0</modelVersion>
<name>Simple Project</name>
<packaging>jar</packaging>
<version>1.0</version>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.0.1</version>
</dependency>
</dependencies>
</project>
Navigate to D:\Codes\Spark\First project in CMD
Execute: 'mvn package' to download dependencies and run project
The output will be
Lines with a: 64, lines with b: 32
For Gradle Project using Eclipse
Create spring starter project -> JAVA - 8
/*SparkFirstProject3Application.java*/
package com.example.demo;
import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;
// https://telegraphhillsoftware.com/starting-the-spark/
public class SparkFirstProject3Application {
public static void main(String[] args) {
String logFile =
"F:\\BigData\\spark-3.0.1-bin-hadoop2.7\\README.md"; // Should be some file on your system
SparkConf conf = new SparkConf().setAppName("Simple
Application");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> logData = sc.textFile(logFile).cache();
long numAs = logData.filter(new Function<String,
Boolean>() {
public Boolean call(String s) { return s.contains("a");
}
}).count();
long numBs = logData.filter(new Function<String,
Boolean>() {
public Boolean call(String s) { return s.contains("b");
}
}).count();
System.out.println("Lines with a: " + numAs + ", lines with b:
" + numBs);
}
}
/* build.gradle */
plugins {
id 'org.springframework.boot' version '2.4.0'
id 'io.spring.dependency-management' version '1.0.10.RELEASE'
id 'java'
}
group = 'com.example'
version = '0.0.1-SNAPSHOT'
sourceCompatibility = '1.8'
repositories {
mavenCentral()
}
dependencies {
// https://mvnrepository.com/artifact/org.apache.spark/spark-core
compile group: 'org.apache.spark', name: 'spark-core_2.12', version: '3.0.1'
}
test {
useJUnitPlatform()
}
Set run configuration(Applicable for Maven project too):
VM Arguments: -Dspark.master=local
If there is any error:
Download SBT from https://www.scala-sbt.org/download.html
Install it and set SBT_HOME as an environment variable with value as <<SBT PATH>>.
Additional Resource:
1. https://telegraphhillsoftware.com/starting-the-spark/
2. https://stackoverflow.com/questions/25481325/how-to-set-up-spark-on-windows
3. https://www.baeldung.com/apache-spark
4. https://github.com/apache/spark/blob/v2.3.1/examples/src/main/java/org/apache/spark/examples/JavaWordCount.java
5. https://programmersought.com/article/63834592230/
6. https://spark.apache.org/docs/2.3.1/streaming-programming-guide.html
7. https://spark.apache.org/docs/2.3.1/rdd-programming-guide.html
Comments
Post a Comment
Comment your queries here. Thanks