Skip to content

Amazon Web Scraper for "Recuperación da Información e Web Semántica"

Notifications You must be signed in to change notification settings

Apcozar/AmazonWebScraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AmazonWebScraper

Amazon Web Scraper for "Recuperación da Información e Web Semántica"

Scrapy

  1. Generar venv: python3 -h venv
  2. Ejecutar el spider: scrapy crawl amazon

Si el venv se genera solo, como pasa el PyCharm, solo hacer el paso 2 una vez se configure correctamente el interpreter y se descarguen las dependencias del proyecto.

Item parser in function process_item() in file pipelines.py

# Example item:
# {'brand': 'Saiet', 'cellular_technology': '4G', 'color': 'Blanco perla (ral 1013)', 'connectivity': 'Bluetooth, Wi-Fi',
#  'image': 'https://m.media-amazon.com/images/I/71EhDktbSRL.__AC_SX300_SY300_QL70_ML2_.jpg', 'memory_storage': '8 GB',
#  'model_name': None, 'os': 'Android 10.0', 'price': '119,90€', 'rating': '3,5 de 5 estrellas',
#  'screen_size': '5 Pulgadas', 'views': '46 valoraciones', 'wireless_net_tech': 'Wi-Fi'}

# more imports
import re  # regexp package
import locale  # set locale for parsing currency values

locale.setlocale(locale.LC_NUMERIC, "es_ES")


def process_item(self, item, spider):
    # rest of code
    doc = {
        # ...
        "price": locale.atof(re.sub("[$|€]", "", str(item['price']))),
        "rating": locale.atof(str(item['rating']).split(' de 5 estrellas')[0]),
        "screen_size": float(str(item['screen_size']).split(' Pulgadas')[0]),
        "views": int(str(item['views']).split(' valoraciones')[0]),
        # ...
    }

    es.index(index="riws_amazon_scraper", document=doc)

Elasticsearch

Link documentacion

1. Configurar memoria docker

Configuramos la memoria de docker para que podamos ejecutar el compose

wsl -d docker-desktop
sysctl -w vm.max_map_count=262144

2. Crear fichero docker-compose

Crear fichero .env

#.env

# Password for the 'elastic' user (at least 6 characters)
ELASTIC_PASSWORD=mueiriws22

# Password for the 'kibana_system' user (at least 6 characters)
KIBANA_PASSWORD=mueiriws22

# Version of Elastic products
STACK_VERSION=8.4.3

# Set the cluster name
CLUSTER_NAME=docker-cluster

# Set to 'basic' or 'trial' to automatically start the 30-day trial
LICENSE=basic
#LICENSE=trial

# Port to expose Elasticsearch HTTP API to the host
ES_PORT=9200
#ES_PORT=127.0.0.1:9200

# Port to expose Kibana to the host
KIBANA_PORT=5601
#KIBANA_PORT=80

# Increase or decrease based on the available host memory (in bytes)
MEM_LIMIT=1073741824

# Project namespace (defaults to the current folder name if not set)
COMPOSE_PROJECT_NAME=riws

Crear fichero docker-compose.yaml:

# docker-compose.yaml

version: "2.2"

services:
  setup:
    image: docker.elastic.co/elasticsearch/elasticsearch:${STACK_VERSION}
    volumes:
      - certs:/usr/share/elasticsearch/config/certs
    user: "0"
    command: >
      bash -c '
        if [ x${ELASTIC_PASSWORD} == x ]; then
          echo "Set the ELASTIC_PASSWORD environment variable in the .env file";
          exit 1;
        elif [ x${KIBANA_PASSWORD} == x ]; then
          echo "Set the KIBANA_PASSWORD environment variable in the .env file";
          exit 1;
        fi;
        if [ ! -f config/certs/ca.zip ]; then
          echo "Creating CA";
          bin/elasticsearch-certutil ca --silent --pem -out config/certs/ca.zip;
          unzip config/certs/ca.zip -d config/certs;
        fi;
        if [ ! -f config/certs/certs.zip ]; then
          echo "Creating certs";
          echo -ne \
          "instances:\n"\
          "  - name: es01\n"\
          "    dns:\n"\
          "      - es01\n"\
          "      - localhost\n"\
          "    ip:\n"\
          "      - 127.0.0.1\n"\
          > config/certs/instances.yml;
          bin/elasticsearch-certutil cert --silent --pem -out config/certs/certs.zip --in config/certs/instances.yml --ca-cert config/certs/ca/ca.crt --ca-key config/certs/ca/ca.key;
          unzip config/certs/certs.zip -d config/certs;
        fi;
        echo "Setting file permissions"
        chown -R root:root config/certs;
        find . -type d -exec chmod 750 \{\} \;;
        find . -type f -exec chmod 640 \{\} \;;
        echo "Waiting for Elasticsearch availability";
        until curl -s --cacert config/certs/ca/ca.crt https://es01:9200 | grep -q "missing authentication credentials"; do sleep 30; done;
        echo "Setting kibana_system password";
        until curl -s -X POST --cacert config/certs/ca/ca.crt -u "elastic:${ELASTIC_PASSWORD}" -H "Content-Type: application/json" https://es01:9200/_security/user/kibana_system/_password -d "{\"password\":\"${KIBANA_PASSWORD}\"}" | grep -q "^{}"; do sleep 10; done;
        echo "All done!";
      '
    healthcheck:
      test: [ "CMD-SHELL", "[ -f config/certs/es01/es01.crt ]" ]
      interval: 1s
      timeout: 5s
      retries: 120

  es01:
    depends_on:
      setup:
        condition: service_healthy
    image: docker.elastic.co/elasticsearch/elasticsearch:${STACK_VERSION}
    volumes:
      - certs:/usr/share/elasticsearch/config/certs
      - esdata01:/usr/share/elasticsearch/data
    ports:
      - ${ES_PORT}:9200
    environment:
      - node.name=es01
      - cluster.name=${CLUSTER_NAME}
      - cluster.initial_master_nodes=es01
      - ELASTIC_PASSWORD=${ELASTIC_PASSWORD}
      - bootstrap.memory_lock=true
      - xpack.security.enabled=true
      - xpack.security.http.ssl.enabled=true
      - xpack.security.http.ssl.key=certs/es01/es01.key
      - xpack.security.http.ssl.certificate=certs/es01/es01.crt
      - xpack.security.http.ssl.certificate_authorities=certs/ca/ca.crt
      - xpack.security.http.ssl.verification_mode=certificate
      - xpack.security.transport.ssl.enabled=true
      - xpack.security.transport.ssl.key=certs/es01/es01.key
      - xpack.security.transport.ssl.certificate=certs/es01/es01.crt
      - xpack.security.transport.ssl.certificate_authorities=certs/ca/ca.crt
      - xpack.security.transport.ssl.verification_mode=certificate
      - xpack.license.self_generated.type=${LICENSE}
    mem_limit: ${MEM_LIMIT}
    ulimits:
      memlock:
        soft: -1
        hard: -1
    healthcheck:
      test:
        [
          "CMD-SHELL",
          "curl -s --cacert config/certs/ca/ca.crt https://localhost:9200 | grep -q 'missing authentication credentials'",
        ]
      interval: 10s
      timeout: 10s
      retries: 120
  kibana:
    depends_on:
      es01:
        condition: service_healthy
    image: docker.elastic.co/kibana/kibana:${STACK_VERSION}
    volumes:
      - certs:/usr/share/kibana/config/certs
      - kibanadata:/usr/share/kibana/data
    ports:
      - ${KIBANA_PORT}:5601
    environment:
      - SERVERNAME=kibana
      - ELASTICSEARCH_HOSTS=https://es01:9200
      - ELASTICSEARCH_USERNAME=kibana_system
      - ELASTICSEARCH_PASSWORD=${KIBANA_PASSWORD}
      - ELASTICSEARCH_SSL_CERTIFICATEAUTHORITIES=config/certs/ca/ca.crt
    mem_limit: ${MEM_LIMIT}
    healthcheck:
      test:
        [
          "CMD-SHELL",
          "curl -s -I http://localhost:5601 | grep -q 'HTTP/1.1 302 Found'",
        ]
      interval: 10s
      timeout: 10s
      retries: 120

volumes:
  certs:
    driver: local
  esdata01:
    driver: local
  kibanadata:
    driver: local

keywords para campos para hacer facetas, si es texto hacer subtipo tipo keyword

Usar keyword para brand, cellular_technology, memory_storage, os

3. Ejecutar servicio

Ejecutar docker-compose up -d en el directorio donde se crearon los fichero .env y docker-compose.

Entrar en el navegador a http://localhost:5601/ y meter credenciales.

  • login: elastic
  • password: $ELASTIC_PASSWORD que se escribio en el fichero .env (en este caso mueiriws22)

3. HTTP Request al servicio para geestionar indices

Utilizar autenticacion basic con login elastic y password: $ELASTIC_PASSWORD a la url: https://localhost:9200

About

Amazon Web Scraper for "Recuperación da Información e Web Semántica"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages