Skip to content

Batch processing for hdfs files to provide classification on incomplete data in dcat format.

License

Notifications You must be signed in to change notification settings

cesarcolle/spark-semantic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spark-semantic

Simple spark architecture to compute batch on hdfs file source.

Architecture

semantic-architecture

Stack

Flume

Flume will allow to consume data from Kafka and feed a HDFS cluster.

HDFS

Distribued file system to store all the data our system want to ingest. Allow us to have immutable data , avoiding the data corruption.

Spark

Spark allow batch to be compute in RAM. It's has best performance with machine learning algorithm.

Start the project

  1. package

Automatic package isn't available for spark. please :

cd spark && sbt package 
  1. You can start the project with

    docker-compose up

Then you can feed Kafka with JSON with DCAT format Note that Dcat is the most common format for open-data. You'll be able to classify your data from your city-API.

About

Batch processing for hdfs files to provide classification on incomplete data in dcat format.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published