Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

airflow dag 수정 #44

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

airflow dag 수정 #44

wants to merge 6 commits into from

Conversation

heehehe
Copy link
Owner

@heehehe heehehe commented Mar 8, 2024

Resolve #13

  • 아래 사진처럼 dag 진행되도록 구현했습니다 :)
    1. get_url_list: 사이트별 크롤링 할 url 목록 불러오기
    2. get_recruit_content_info: url별 크롤링으로 page source 추출
    3. postprocess: html 파싱으로 데이터 추출
    4. upload_to_bigquery: 빅쿼리에 데이터 업로드
    5. run_dbt: dbt 통해 최종 사용할 테이블 구축
    image
firefox로 해결한 부분
  • 현재 get_url_list에서 아래처럼 WebDriverException 오류 발생중인데요🥲,
    따로 분리해서 처리하는게 좋을 것 같아서 먼저 리뷰 요청드려요!
    --> chrome은 driver랑 chrome 버전이 안맞아서 사용 못하고, firefox 이용해서 해결 가능!!
    *** Found local files:
    ***   * /opt/airflow/logs/dag_id=job_trend_daily/run_id=scheduled__2024-03-08T00:00:00+00:00/task_id=wanted.get_url_list/attempt=3.log
    [2024-03-09, 03:31:16 UTC] {taskinstance.py:1979} INFO - Dependencies all met for dep_context=non-requeueable deps ti=<TaskInstance: job_trend_daily.wanted.get_url_list scheduled__2024-03-08T00:00:00+00:00 [queued]>
    [2024-03-09, 03:31:16 UTC] {taskinstance.py:1979} INFO - Dependencies all met for dep_context=requeueable deps ti=<TaskInstance: job_trend_daily.wanted.get_url_list scheduled__2024-03-08T00:00:00+00:00 [queued]>
    [2024-03-09, 03:31:16 UTC] {taskinstance.py:2193} INFO - Starting attempt 3 of 3
    [2024-03-09, 03:31:16 UTC] {taskinstance.py:2214} INFO - Executing <Task(PythonOperator): wanted.get_url_list> on 2024-03-08 00:00:00+00:00
    [2024-03-09, 03:31:16 UTC] {standard_task_runner.py:60} INFO - Started process 423 to run task
    [2024-03-09, 03:31:16 UTC] {standard_task_runner.py:87} INFO - Running: ['***', 'tasks', 'run', 'job_trend_daily', 'wanted.get_url_list', 'scheduled__2024-03-08T00:00:00+00:00', '--job-id', '19', '--raw', '--subdir', 'DAGS_FOLDER/deploy_daily.py', '--cfg-path', '/tmp/tmp9p2seab7']
    [2024-03-09, 03:31:16 UTC] {standard_task_runner.py:88} INFO - Job 19: Subtask wanted.get_url_list
    [2024-03-09, 03:31:16 UTC] {task_command.py:423} INFO - Running <TaskInstance: job_trend_daily.wanted.get_url_list scheduled__2024-03-08T00:00:00+00:00 [running]> on host 43e3f6697550
    [2024-03-09, 03:31:17 UTC] {taskinstance.py:2510} INFO - Exporting env vars: AIRFLOW_CTX_DAG_OWNER='admin' AIRFLOW_CTX_DAG_ID='job_trend_daily' AIRFLOW_CTX_TASK_ID='wanted.get_url_list' AIRFLOW_CTX_EXECUTION_DATE='2024-03-08T00:00:00+00:00' AIRFLOW_CTX_TRY_NUMBER='3' AIRFLOW_CTX_DAG_RUN_ID='scheduled__2024-03-08T00:00:00+00:00'
    [2024-03-09, 03:31:17 UTC] {taskinstance.py:2728} ERROR - Task failed with exception
    Traceback (most recent call last):
      File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 444, in _execute_task
        result = _execute_callable(context=context, **execute_callable_kwargs)
      File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 414, in _execute_callable
        return execute_callable(context=context, **execute_callable_kwargs)
      File "/home/airflow/.local/lib/python3.8/site-packages/airflow/operators/python.py", line 200, in execute
        return_value = self.execute_callable()
      File "/home/airflow/.local/lib/python3.8/site-packages/airflow/operators/python.py", line 217, in execute_callable
        return self.python_callable(*self.op_args, **self.op_kwargs)
      File "/opt/airflow/dags/crawling.py", line 670, in get_url_list
        driver = self.driver()
      File "/home/airflow/.local/lib/python3.8/site-packages/selenium/webdriver/chrome/webdriver.py", line 45, in __init__
        super().__init__(
      File "/home/airflow/.local/lib/python3.8/site-packages/selenium/webdriver/chromium/webdriver.py", line 50, in __init__
        self.service.start()
      File "/home/airflow/.local/lib/python3.8/site-packages/selenium/webdriver/common/service.py", line 102, in start
        self.assert_process_still_running()
      File "/home/airflow/.local/lib/python3.8/site-packages/selenium/webdriver/common/service.py", line 115, in assert_process_still_running
        raise WebDriverException(f"Service {self._path} unexpectedly exited. Status code was: {return_code}")
    selenium.common.exceptions.WebDriverException: Message: Service /home/airflow/.cache/selenium/chromedriver/linux64/122.0.6261.111/chromedriver unexpectedly exited. Status code was: 127
    [2024-03-09, 03:31:17 UTC] {taskinstance.py:1149} INFO - Marking task as FAILED. dag_id=job_trend_daily, task_id=wanted.get_url_list, execution_date=20240308T000000, start_date=20240309T033116, end_date=20240309T033117
    [2024-03-09, 03:31:17 UTC] {standard_task_runner.py:107} ERROR - Failed to execute job 19 for task wanted.get_url_list (Message: Service /home/airflow/.cache/selenium/chromedriver/linux64/122.0.6261.111/chromedriver unexpectedly exited. Status code was: 127
    ; 423)
    [2024-03-09, 03:31:17 UTC] {local_task_job_runner.py:234} INFO - Task exited with return code 1
    [2024-03-09, 03:31:17 UTC] {taskinstance.py:3309} INFO - 0 downstream tasks scheduled from follow-on schedule check
    

@heehehe heehehe self-assigned this Mar 8, 2024
@heehehe heehehe mentioned this pull request Mar 8, 2024
@heehehe heehehe marked this pull request as ready for review March 9, 2024 05:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

airflow 구축
1 participant