diff --git a/docs/modules/hdfs/images/hdfs_overview.drawio.svg b/docs/modules/hdfs/images/hdfs_overview.drawio.svg new file mode 100644 index 00000000..e0ba5c59 --- /dev/null +++ b/docs/modules/hdfs/images/hdfs_overview.drawio.svg @@ -0,0 +1,4 @@ + + + +
Pod
<name>-<role>-<rg1>-1
Pod...
HDFS Operator
HDFS Operator
StatefulSet
<name>-<role>-<rg1>
StatefulSet...
Service
<name>-<role>-<rg1>
Service...
Pod
<name>-<role>-<rg1>-0
Pod...
ConfigMap
<name>-<role>-<rg1>
ConfigMap...
HdfsCluster
<name>
HdfsCluster...
create
create
read
read
Legend
Legend
Operator
Operator
Resource
Resource
Custom
Resource
Custom...
role group
<rg1>
role group...
StatefulSet
<name>-<role>-<rg2>
StatefulSet...
Service
<name>-<role>-<rg2>
Service...
Pod
<name>-<role>-<rg2>-0
Pod...
ConfigMap
<name>-<role>-<rg2>
ConfigMap...
Service
<name>-<role>
Service...
role
<role>
role...
references
references
role group
<rg2>
role group...
for each role (dataNode, journalNode, nameNode):
for each role (dataNode, journalNode, nameNode):
ConfigMap
<name>
ConfigMap...
discovery
ConfigMap
discovery...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/modules/hdfs/pages/index.adoc b/docs/modules/hdfs/pages/index.adoc index 34d4f18c..a33eac4c 100644 --- a/docs/modules/hdfs/pages/index.adoc +++ b/docs/modules/hdfs/pages/index.adoc @@ -1,18 +1,25 @@ = Stackable Operator for Apache HDFS +:description: The Stackable Operator for Apache HDFS is a Kubernetes operator that can manage Apache HDFS clusters. Learn about its features, resources, dependencies and demos, and see the list of supported HDFS versions. +:keywords: Stackable Operator, Hadoop, Apache HDFS, Kubernetes, k8s, operator, engineer, big data, metadata, storage, cluster, distributed storage -The Stackable Operator for https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html[Apache HDFS] is used to set up HFDS in high-availability mode. It depends on the xref:zookeeper:ROOT:index.adoc[] to operate a ZooKeeper cluster to coordinate the active and standby NameNodes. +The Stackable Operator for https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html[Apache HDFS] (Hadoop Distributed File System) is used to set up HFDS in high-availability mode. HDFS is a distributed file system designed to store and manage massive amounts of data across multiple machines in a fault-tolerant manner. The Operator depends on the xref:zookeeper:index.adoc[] to operate a ZooKeeper cluster to coordinate the active and standby NameNodes. -NOTE: This operator only works with images from the https://repo.stackable.tech/#browse/browse:docker:v2%2Fstackable%2Fhadoop[Stackable] repository +== Getting started -== Roles +Follow the xref:getting_started/index.adoc[Getting started guide] which will guide you through installing the Stackable HDFS and ZooKeeper Operators, setting up ZooKeeper and HDFS and writing a file to HDFS to verify that everything is set up correctly. -Three xref:home:concepts:roles-and-role-groups.adoc[roles] of the HDFS cluster are implemented: +Afterwards you can consult the xref:usage-guide/index.adoc[] to learn more about tailoring your HDFS configuration to your needs, or have a look at the <> for some example setups. + +== Operator model + +The Operator manages the _HdfsCluster_ custom resource. The cluster implements three xref:home:concepts:roles-and-role-groups.adoc[roles]: * DataNode - responsible for storing the actual data. * JournalNode - responsible for keeping track of HDFS blocks and used to perform failovers in case the active NameNode fails. For details see: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html * NameNode - responsible for keeping track of HDFS blocks and providing access to the data. -== Kubernetes objects + +image::hdfs_overview.drawio.svg[A diagram depicting the Kubernetes resources created by the Stackable Operator for Apache HDFS] The operator creates the following K8S objects per role group defined in the custom resource. @@ -28,15 +35,22 @@ In the custom resource you can specify the number of replicas per role group (Na * 1 JournalNode * 1 DataNode (should match at least the `clusterConfig.dfsReplication` factor) +The Operator creates a xref:concepts:service_discovery.adoc[service discovery ConfigMap] for the HDFS instance. The discovery ConfigMap contains the `core-site.xml` file and the `hdfs-site.xml` file. + +== Dependencies + +HDFS depends on ZooKeeper for coordination between nodes. You can run a ZooKeeper cluster with the xref:zookeeper:index.adoc[]. Additionally, the xref:commons-operator:index.adoc[] and xref:secret-operator:index.adoc[] are needed. + +== [[demos]]Demos + +Two demos that use HDFS are available. + +**xref:stackablectl::demos/hbase-hdfs-load-cycling-data.adoc[]** loads a dataset of cycling data from S3 into HDFS and then uses HBase to analyze the data. + +**xref:stackablectl::demos/jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data.adoc[]** showcases the integration between HDFS and Jupyter. New York Taxi data is stored in HDFS and analyzed in a Jupyter notebook. + == Supported Versions The Stackable Operator for Apache HDFS currently supports the following versions of HDFS: include::partial$supported-versions.adoc[] - -== Docker image - -[source] ----- -docker pull docker.stackable.tech/stackable/hadoop: -----