Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chatgpt_synthesis.jsonl file #219

Open
austinmw opened this issue Dec 7, 2023 · 4 comments
Open

chatgpt_synthesis.jsonl file #219

austinmw opened this issue Dec 7, 2023 · 4 comments

Comments

@austinmw
Copy link

austinmw commented Dec 7, 2023

Hi, do you have the example chatgpt_synthesis.jsonl file available?

@ybalbert001
Copy link
Contributor

The original experiment data is from customer, I just mock the single example in notebook to show its schema. So It's not available to share true data. sorry for that

@austinmw
Copy link
Author

austinmw commented Dec 29, 2023

Hi @ybalbert001 , thanks for your reply. Is the data generation process code available? For example should I do something like this?

import pandas as pd
import json
from tqdm.auto import tqdm
from langchain.llms import Bedrock

# Replace with your actual file path and Bedrock model ID
csv_file_path = 'qa_pairs.csv'
bedrock_model_id = 'anthropic.claude-v1'

def generate_enhanced_data(csv_path, model_id):
    # Read CSV file
    df = pd.read_csv(csv_path)

    # Instantiate the Bedrock LLM
    llm = Bedrock(model_id=model_id, model_kwargs={'max_tokens_to_sample': 2000})

    # Container for the enhanced data
    enhanced_data = []

    # Loop through the DataFrame
    for index, row in tqdm(df.iterrows(), total=df.shape[0], desc="Generating Data"):
        original_question = row['Question']
        original_answer = row['Answer']
        source_context = row['Source Text']

        # Generate a new question
        new_question_prompt = f"\n\nHuman:Based on this context <context>{source_context}</context>, generate a related question.\n\nAssistant: Here is a sensible question based on the context provided: "
        new_question_result = llm.generate([new_question_prompt])
        new_question = new_question_result.generations[0][0].text.strip()
        print(new_question)

        # Generate a new answer
        new_answer_prompt = f"\n\nHuman: Answer this question: {new_question}. Use this context to answer: <context>{source_context}</context>.\n\nAssistant: Here is an answer based on the context provided: "
        new_answer_result = llm.generate([new_answer_prompt])
        new_answer = new_answer_result.generations[0][0].text.strip()

        # Append to the list
        enhanced_data.append({
            'origin_question': original_question,
            'origin_answer': original_answer,
            'generate_question': new_question,
            'generate_answer': new_answer
        })


    # Save the enhanced data to a JSONL file
    with open('bedrock_synthesis.jsonl', 'w') as outfile:
        for entry in enhanced_data:
            outfile.write(json.dumps(entry) + '\n')

# Call the function
generate_enhanced_data(csv_file_path, bedrock_model_id)

@631068264
Copy link

631068264 commented Apr 13, 2024

@ybalbert001 I want to know how to did you generate the generate_question and generate_answer in the
example bge_zh_research.ipynb

@ybalbert001
Copy link
Contributor

Hi @ybalbert001 , thanks for your reply. Is the data generation process code available? For example should I do something like this?

import pandas as pd
import json
from tqdm.auto import tqdm
from langchain.llms import Bedrock

# Replace with your actual file path and Bedrock model ID
csv_file_path = 'qa_pairs.csv'
bedrock_model_id = 'anthropic.claude-v1'

def generate_enhanced_data(csv_path, model_id):
    # Read CSV file
    df = pd.read_csv(csv_path)

    # Instantiate the Bedrock LLM
    llm = Bedrock(model_id=model_id, model_kwargs={'max_tokens_to_sample': 2000})

    # Container for the enhanced data
    enhanced_data = []

    # Loop through the DataFrame
    for index, row in tqdm(df.iterrows(), total=df.shape[0], desc="Generating Data"):
        original_question = row['Question']
        original_answer = row['Answer']
        source_context = row['Source Text']

        # Generate a new question
        new_question_prompt = f"\n\nHuman:Based on this context <context>{source_context}</context>, generate a related question.\n\nAssistant: Here is a sensible question based on the context provided: "
        new_question_result = llm.generate([new_question_prompt])
        new_question = new_question_result.generations[0][0].text.strip()
        print(new_question)

        # Generate a new answer
        new_answer_prompt = f"\n\nHuman: Answer this question: {new_question}. Use this context to answer: <context>{source_context}</context>.\n\nAssistant: Here is an answer based on the context provided: "
        new_answer_result = llm.generate([new_answer_prompt])
        new_answer = new_answer_result.generations[0][0].text.strip()

        # Append to the list
        enhanced_data.append({
            'origin_question': original_question,
            'origin_answer': original_answer,
            'generate_question': new_question,
            'generate_answer': new_answer
        })


    # Save the enhanced data to a JSONL file
    with open('bedrock_synthesis.jsonl', 'w') as outfile:
        for entry in enhanced_data:
            outfile.write(json.dumps(entry) + '\n')

# Call the function
generate_enhanced_data(csv_file_path, bedrock_model_id)

correct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants