Amazon Web Scraper for "Recuperación da Información e Web Semántica"
- Generar venv:
python3 -h venv
- Ejecutar el spider:
scrapy crawl amazon
Si el venv se genera solo, como pasa el PyCharm, solo hacer el paso 2 una vez se configure correctamente el interpreter y se descarguen las dependencias del proyecto.
Item parser in function process_item() in file pipelines.py
# Example item:
# {'brand': 'Saiet', 'cellular_technology': '4G', 'color': 'Blanco perla (ral 1013)', 'connectivity': 'Bluetooth, Wi-Fi',
# 'image': 'https://m.media-amazon.com/images/I/71EhDktbSRL.__AC_SX300_SY300_QL70_ML2_.jpg', 'memory_storage': '8 GB',
# 'model_name': None, 'os': 'Android 10.0', 'price': '119,90€', 'rating': '3,5 de 5 estrellas',
# 'screen_size': '5 Pulgadas', 'views': '46 valoraciones', 'wireless_net_tech': 'Wi-Fi'}
# more imports
import re # regexp package
import locale # set locale for parsing currency values
locale.setlocale(locale.LC_NUMERIC, "es_ES")
def process_item(self, item, spider):
# rest of code
doc = {
# ...
"price": locale.atof(re.sub("[$|€]", "", str(item['price']))),
"rating": locale.atof(str(item['rating']).split(' de 5 estrellas')[0]),
"screen_size": float(str(item['screen_size']).split(' Pulgadas')[0]),
"views": int(str(item['views']).split(' valoraciones')[0]),
# ...
}
es.index(index="riws_amazon_scraper", document=doc)
Configuramos la memoria de docker para que podamos ejecutar el compose
wsl -d docker-desktop
sysctl -w vm.max_map_count=262144
Crear fichero .env
#.env
# Password for the 'elastic' user (at least 6 characters)
ELASTIC_PASSWORD=mueiriws22
# Password for the 'kibana_system' user (at least 6 characters)
KIBANA_PASSWORD=mueiriws22
# Version of Elastic products
STACK_VERSION=8.4.3
# Set the cluster name
CLUSTER_NAME=docker-cluster
# Set to 'basic' or 'trial' to automatically start the 30-day trial
LICENSE=basic
#LICENSE=trial
# Port to expose Elasticsearch HTTP API to the host
ES_PORT=9200
#ES_PORT=127.0.0.1:9200
# Port to expose Kibana to the host
KIBANA_PORT=5601
#KIBANA_PORT=80
# Increase or decrease based on the available host memory (in bytes)
MEM_LIMIT=1073741824
# Project namespace (defaults to the current folder name if not set)
COMPOSE_PROJECT_NAME=riws
Crear fichero docker-compose.yaml:
# docker-compose.yaml
version: "2.2"
services:
setup:
image: docker.elastic.co/elasticsearch/elasticsearch:${STACK_VERSION}
volumes:
- certs:/usr/share/elasticsearch/config/certs
user: "0"
command: >
bash -c '
if [ x${ELASTIC_PASSWORD} == x ]; then
echo "Set the ELASTIC_PASSWORD environment variable in the .env file";
exit 1;
elif [ x${KIBANA_PASSWORD} == x ]; then
echo "Set the KIBANA_PASSWORD environment variable in the .env file";
exit 1;
fi;
if [ ! -f config/certs/ca.zip ]; then
echo "Creating CA";
bin/elasticsearch-certutil ca --silent --pem -out config/certs/ca.zip;
unzip config/certs/ca.zip -d config/certs;
fi;
if [ ! -f config/certs/certs.zip ]; then
echo "Creating certs";
echo -ne \
"instances:\n"\
" - name: es01\n"\
" dns:\n"\
" - es01\n"\
" - localhost\n"\
" ip:\n"\
" - 127.0.0.1\n"\
> config/certs/instances.yml;
bin/elasticsearch-certutil cert --silent --pem -out config/certs/certs.zip --in config/certs/instances.yml --ca-cert config/certs/ca/ca.crt --ca-key config/certs/ca/ca.key;
unzip config/certs/certs.zip -d config/certs;
fi;
echo "Setting file permissions"
chown -R root:root config/certs;
find . -type d -exec chmod 750 \{\} \;;
find . -type f -exec chmod 640 \{\} \;;
echo "Waiting for Elasticsearch availability";
until curl -s --cacert config/certs/ca/ca.crt https://es01:9200 | grep -q "missing authentication credentials"; do sleep 30; done;
echo "Setting kibana_system password";
until curl -s -X POST --cacert config/certs/ca/ca.crt -u "elastic:${ELASTIC_PASSWORD}" -H "Content-Type: application/json" https://es01:9200/_security/user/kibana_system/_password -d "{\"password\":\"${KIBANA_PASSWORD}\"}" | grep -q "^{}"; do sleep 10; done;
echo "All done!";
'
healthcheck:
test: [ "CMD-SHELL", "[ -f config/certs/es01/es01.crt ]" ]
interval: 1s
timeout: 5s
retries: 120
es01:
depends_on:
setup:
condition: service_healthy
image: docker.elastic.co/elasticsearch/elasticsearch:${STACK_VERSION}
volumes:
- certs:/usr/share/elasticsearch/config/certs
- esdata01:/usr/share/elasticsearch/data
ports:
- ${ES_PORT}:9200
environment:
- node.name=es01
- cluster.name=${CLUSTER_NAME}
- cluster.initial_master_nodes=es01
- ELASTIC_PASSWORD=${ELASTIC_PASSWORD}
- bootstrap.memory_lock=true
- xpack.security.enabled=true
- xpack.security.http.ssl.enabled=true
- xpack.security.http.ssl.key=certs/es01/es01.key
- xpack.security.http.ssl.certificate=certs/es01/es01.crt
- xpack.security.http.ssl.certificate_authorities=certs/ca/ca.crt
- xpack.security.http.ssl.verification_mode=certificate
- xpack.security.transport.ssl.enabled=true
- xpack.security.transport.ssl.key=certs/es01/es01.key
- xpack.security.transport.ssl.certificate=certs/es01/es01.crt
- xpack.security.transport.ssl.certificate_authorities=certs/ca/ca.crt
- xpack.security.transport.ssl.verification_mode=certificate
- xpack.license.self_generated.type=${LICENSE}
mem_limit: ${MEM_LIMIT}
ulimits:
memlock:
soft: -1
hard: -1
healthcheck:
test:
[
"CMD-SHELL",
"curl -s --cacert config/certs/ca/ca.crt https://localhost:9200 | grep -q 'missing authentication credentials'",
]
interval: 10s
timeout: 10s
retries: 120
kibana:
depends_on:
es01:
condition: service_healthy
image: docker.elastic.co/kibana/kibana:${STACK_VERSION}
volumes:
- certs:/usr/share/kibana/config/certs
- kibanadata:/usr/share/kibana/data
ports:
- ${KIBANA_PORT}:5601
environment:
- SERVERNAME=kibana
- ELASTICSEARCH_HOSTS=https://es01:9200
- ELASTICSEARCH_USERNAME=kibana_system
- ELASTICSEARCH_PASSWORD=${KIBANA_PASSWORD}
- ELASTICSEARCH_SSL_CERTIFICATEAUTHORITIES=config/certs/ca/ca.crt
mem_limit: ${MEM_LIMIT}
healthcheck:
test:
[
"CMD-SHELL",
"curl -s -I http://localhost:5601 | grep -q 'HTTP/1.1 302 Found'",
]
interval: 10s
timeout: 10s
retries: 120
volumes:
certs:
driver: local
esdata01:
driver: local
kibanadata:
driver: local
keywords para campos para hacer facetas, si es texto hacer subtipo tipo keyword
Usar keyword para brand, cellular_technology, memory_storage, os
Ejecutar docker-compose up -d
en el directorio donde se crearon los fichero .env y docker-compose.
Entrar en el navegador a http://localhost:5601/
y meter credenciales.
- login: elastic
- password: $ELASTIC_PASSWORD que se escribio en el fichero .env (en este caso mueiriws22)
Utilizar autenticacion basic con login elastic y password: $ELASTIC_PASSWORD a la url: https://localhost:9200