Optimize AI chat response speed with streaming, model upgrade, and async operations by Copilot · Pull Request #2 · HexagonLab-s-SilverPlus/python

Copilot · 2025-12-02T07:22:44Z

Chat response latency was bottlenecked by sequential API calls, slow GPT-4-turbo model, synchronous TTS generation, and blocking DB saves.

Changes

Model Upgrade

gpt-4-turbo / gpt-4 → gpt-4o-mini across all endpoints (3-5x faster)

Streaming Response

New /chat/stream endpoint using SSE for real-time token delivery

# SSE format
data: {"type": "workspace", "workspaceId": "..."}
data: {"type": "content", "content": "안녕"}
data: {"type": "content", "content": "하세요"}
data: {"type": "done"}

TTS Separation

New /chat/tts endpoint for on-demand audio generation
Original /chat now accepts includeTts param (default: false)

Background Message Saving

User/AI messages saved via daemon threads, response returns immediately

threading.Thread(
    target=save_message_background,
    args=(SPRING_BOOT_API_URL, chat_data, headers, workspace_id),
    daemon=True
).start()

Files Modified

chat.py — streaming, TTS separation, async saves, model change
document_service.py — model change

Original prompt

문제 상황

현재 AI 채팅 응답 속도가 매우 느립니다. chat.py 분석 결과 다음과 같은 병목점들이 발견되었습니다.

개선이 필요한 항목들

1. 순차적 API 호출 문제 (가장 큰 병목)

현재 모든 작업이 순차적으로 실행되고 있습니다:

1단계: 감정 분석 analyze_sentiment(user_message)

2단계: GPT API 호출

3단계: TTS 생성

4단계: Spring Boot로 사용자 메시지 저장

5단계: Spring Boot로 AI 메시지 저장

2. GPT-4-turbo 모델 사용

gpt-4-turbo는 가장 느린 모델입니다

gpt-4o-mini로 변경하면 3-5배 빠릅니다

3. TTS 생성이 동기적으로 실행
def generate_tts(ai_reply):
    tts = gTTS(ai_reply, lang="ko")  # Google TTS - 네트워크 요청 필요
4. 워크스페이스 생성 시 추가 GPT 호출
def create_workspace(current_user, user_message, ai_reply):
    response = client.chat.completions.create(
        model="gpt-4-turbo",  # 또 다른 느린 호출
        messages=[...]
    )
구현해야 할 개선 사항

1. 모델 변경
# Before
model="gpt-4-turbo"

# After
model="gpt-4o-mini"
모든 gpt-4-turbo 사용을 gpt-4o-mini로 변경

2. 스트리밍 응답 적용
# 스트리밍으로 변경
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[...],
    stream=True  # 스트리밍 활성화
)

# Flask에서 스트리밍 응답
def generate():
    for chunk in response:
        if chunk.choices[0].delta.content:
            yield chunk.choices[0].delta.content

return Response(generate(), mimetype='text/event-stream')
3. 비동기 처리 적용
import asyncio
import aiohttp

async def chat_handler():
    # 병렬로 실행 가능한 작업들
    emotion_task = asyncio.create_task(analyze_sentiment_async(user_message))
    
    # GPT 응답 먼저 받고
    ai_reply = await get_gpt_response(user_message)
    
    # TTS와 DB 저장은 병렬로
    await asyncio.gather(
        generate_tts_async(ai_reply),
        save_to_spring_boot(user_chat_data),
        save_to_spring_boot(assistant_chat_data)
    )
4. TTS 분리 - 별도 엔드포인트로 분리하여 React에서 텍스트 먼저 표시 후 TTS는 나중에 로드할 수 있도록 함

5. 메시지 저장 비동기화
import threading

def save_messages_background():
    requests.post(f"{SPRING_BOOT_API_URL}/api/chat/save", ...)

# 응답 먼저 반환
response = jsonify({"reply": ai_reply})

# 저장은 백그라운드에서
thread = threading.Thread(target=save_messages_background)
thread.start()

return response
수정 대상 파일

chat.py - 메인 채팅 로직

필요시 document_service.py 등 다른 GPT 호출 부분도 동일하게 개선

기대 효과

모델 변경: 3-5배 속도 개선

스트리밍: 체감 5배 이상 개선 (첫 글자가 바로 표시됨)

비동기 처리: 1-2초 추가 단축

TTS 분리: 1-2초 단축

This pull request was created as a result of the following prompt from Copilot chat.

문제 상황

현재 AI 채팅 응답 속도가 매우 느립니다. chat.py 분석 결과 다음과 같은 병목점들이 발견되었습니다.

개선이 필요한 항목들

1. 순차적 API 호출 문제 (가장 큰 병목)

현재 모든 작업이 순차적으로 실행되고 있습니다:

1단계: 감정 분석 analyze_sentiment(user_message)

2단계: GPT API 호출

3단계: TTS 생성

4단계: Spring Boot로 사용자 메시지 저장

5단계: Spring Boot로 AI 메시지 저장

2. GPT-4-turbo 모델 사용

gpt-4-turbo는 가장 느린 모델입니다

gpt-4o-mini로 변경하면 3-5배 빠릅니다

3. TTS 생성이 동기적으로 실행
def generate_tts(ai_reply):
    tts = gTTS(ai_reply, lang="ko")  # Google TTS - 네트워크 요청 필요
4. 워크스페이스 생성 시 추가 GPT 호출
def create_workspace(current_user, user_message, ai_reply):
    response = client.chat.completions.create(
        model="gpt-4-turbo",  # 또 다른 느린 호출
        messages=[...]
    )
구현해야 할 개선 사항

1. 모델 변경
# Before
model="gpt-4-turbo"

# After
model="gpt-4o-mini"
모든 gpt-4-turbo 사용을 gpt-4o-mini로 변경

2. 스트리밍 응답 적용
# 스트리밍으로 변경
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[...],
    stream=True  # 스트리밍 활성화
)

# Flask에서 스트리밍 응답
def generate():
    for chunk in response:
        if chunk.choices[0].delta.content:
            yield chunk.choices[0].delta.content

return Response(generate(), mimetype='text/event-stream')
3. 비동기 처리 적용
import asyncio
import aiohttp

async def chat_handler():
    # 병렬로 실행 가능한 작업들
    emotion_task = asyncio.create_task(analyze_sentiment_async(user_message))
    
    # GPT 응답 먼저 받고
    ai_reply = await get_gpt_response(user_message)
    
    # TTS와 DB 저장은 병렬로
    await asyncio.gather(
        generate_tts_async(ai_reply),
        save_to_spring_boot(user_chat_data),
        save_to_spring_boot(assistant_chat_data)
    )
4. TTS 분리 - 별도 엔드포인트로 분리하여 React에서 텍스트 먼저 표시 후 TTS는 나중에 로드할 수 있도록 함

5. 메시지 저장 비동기화
import threading

def save_messages_background():
    requests.post(f"{SPRING_BOOT_API_URL}/api/chat/save", ...)

# 응답 먼저 반환
response = jsonify({"reply": ai_reply})

# 저장은 백그라운드에서
thread = threading.Thread(target=save_messages_background)
thread.start()

return response
수정 대상 파일

chat.py - 메인 채팅 로직

필요시 document_service.py 등 다른 GPT 호출 부분도 동일하게 개선

기대 효과

모델 변경: 3-5배 속도 개선

스트리밍: 체감 5배 이상 개선 (첫 글자가 바로 표시됨)

비동기 처리: 1-2초 추가 단축

TTS 분리: 1-2초 단축

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

…n, background saving Co-authored-by: silver0-stack <76518850+silver0-stack@users.noreply.github.com>

…e to threads Co-authored-by: silver0-stack <76518850+silver0-stack@users.noreply.github.com>

Initial plan

a781420

Copilot AI assigned Copilot and silver0-stack Dec 2, 2025

Copilot started work on behalf of silver0-stack December 2, 2025 07:23 View session

Copilot AI and others added 2 commits December 2, 2025 07:26

Optimize chat response speed: model changes, streaming, TTS separatio…

5bb17f2

…n, background saving Co-authored-by: silver0-stack <76518850+silver0-stack@users.noreply.github.com>

Address code review feedback: move json import to top, add daemon=Tru…

2f9302a

…e to threads Co-authored-by: silver0-stack <76518850+silver0-stack@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Optimize AI chat response speed by parallelizing tasks~~ Optimize AI chat response speed with streaming, model upgrade, and async operations Dec 2, 2025

Copilot AI requested a review from silver0-stack December 2, 2025 07:31

Copilot finished work on behalf of silver0-stack December 2, 2025 07:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize AI chat response speed with streaming, model upgrade, and async operations#2

Optimize AI chat response speed with streaming, model upgrade, and async operations#2
Copilot wants to merge 3 commits intomainfrom
copilot/optimize-chat-response-speed

Copilot AI commented Dec 2, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Model Upgrade

Streaming Response

TTS Separation

Background Message Saving

Files Modified

문제 상황

개선이 필요한 항목들

1. 순차적 API 호출 문제 (가장 큰 병목)

2. GPT-4-turbo 모델 사용

3. TTS 생성이 동기적으로 실행

4. 워크스페이스 생성 시 추가 GPT 호출

구현해야 할 개선 사항

1. 모델 변경

2. 스트리밍 응답 적용

3. 비동기 처리 적용

4. TTS 분리 - 별도 엔드포인트로 분리하여 React에서 텍스트 먼저 표시 후 TTS는 나중에 로드할 수 있도록 함

5. 메시지 저장 비동기화

수정 대상 파일

기대 효과

문제 상황

개선이 필요한 항목들

1. 순차적 API 호출 문제 (가장 큰 병목)

2. GPT-4-turbo 모델 사용

3. TTS 생성이 동기적으로 실행

4. 워크스페이스 생성 시 추가 GPT 호출

구현해야 할 개선 사항

1. 모델 변경

2. 스트리밍 응답 적용

3. 비동기 처리 적용

4. TTS 분리 - 별도 엔드포인트로 분리하여 React에서 텍스트 먼저 표시 후 TTS는 나중에 로드할 수 있도록 함

5. 메시지 저장 비동기화

수정 대상 파일

기대 효과

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Dec 2, 2025 •

edited

Loading