Displaying README.md



title: Automated Problem Solver (Final Assignment) emoji: ๐Ÿค– colorFrom: gray colorTo: blue sdk: gradio sdk_version: 5.25.2 app_file: app.py pinned: false hf_oauth: true # optional, default duration is 8 hours/480 minutes. Max duration is 30 days/43200 minutes. hf_oauth_expiration_minutes: 480

๐Ÿค– Automated Problem Solver (Final Assignment)

Hello fellow agent builders! This repository contains the final assignment for an automated problem-solving system. It utilizes a multi-agent architecture built with smolagents, leveraging various specialized tools and large language models (LLMs) accessed via OpenRouter to tackle a diverse range of questions.

The system is designed to:

  1. Understand & Clarify: Analyze the input question and associated files.
  2. Delegate: Route the task to the most suitable specialized agent (Web Search, YouTube Interaction, Multimedia Analysis, Code Interpretation).
  3. Utilize Tools: Employ custom tools for specific actions like YouTube video downloading, Wikipedia searching, speech-to-text transcription, and video audio extraction.
  4. Reason & Synthesize: Process information gathered by agents and tools to formulate a final answer.

โœจ Core Concepts & Architecture

This project employs a hierarchical multi-agent system:

  • Chief Problem Solver Agent (Manager): The main orchestrator (chief_problem_solver_agent). It receives the initial problem, potentially clarifies it using a dedicated agent, and delegates the task to the appropriate specialized worker agent. It uses meta-llama/llama-4-maverick:free by default.
  • Specialized Agents:
    • Clarification Agent: Refines the user's question if needed. Uses a strong reasoning model (qwen/qwen3-235b-a22b by default).
    • YouTube Interaction Agent: Handles questions involving YouTube videos, utilizing relevant tools. Uses meta-llama/llama-4-maverick:free by default.
    • Web Search Manager Agent: Manages web searches using Serper and delegates specific page retrieval/analysis to its sub-agent. Uses meta-llama/llama-4-scout:free (high context) by default.
      • Website Retrieval Agent: Fetches and processes content from specific web pages. Uses a strong reasoning model (qwen/qwen3-235b-a22b by default).
    • Multimedia Analysis Agent: Processes images and audio files (using STT tools internally). Uses a multimodal model capable of vision (meta-llama/llama-4-scout:free by default).
    • Code Interpreter Agent: Executes and analyzes provided code snippets. Uses a coding-specialized model (open-r1/olympiccoder-32b:free by default).

Why OpenRouter?

Using OpenRouter provides significant advantages:

  1. Model Flexibility: Easily swap different LLMs for different agents to optimize for cost, performance, or specific capabilities (reasoning, coding, vision).
  2. Access to Diverse Models: Test and use a wide variety of models, including powerful free-tier options like qwerky-72b:free, olympiccoder-32b:free, or various Llama models.
  3. Simplified API: Access multiple LLM providers through a single API endpoint and key.

You'll need an OpenRouter API key to run this project.

๐Ÿ› ๏ธ Custom Tools

The system relies on several custom tools to interact with external resources:

YouTubeVideoDownloaderTool

Downloads YouTube videos.

  • Test best quality (default):
    python cli.py --test-tool YouTubeVideoDownloaderTool --test-input "https://www.youtube.com/watch?v=aqz-KE-bpKQ"
    
  • Test standard quality:
    python cli.py --test-tool YouTubeVideoDownloaderTool --test-input "https://www.youtube.com/watch?v=aqz-KE-bpKQ" --test-quality standard
    
  • Test low quality:
    python cli.py --test-tool YouTubeVideoDownloaderTool --test-input "https://www.youtube.com/watch?v=aqz-KE-bpKQ" --test-quality low
    

CustomWikipediaSearchTool

Searches current or historical Wikipedia articles. Requires a User-Agent.

  • Test Current Summary (Wikitext - default):
    python cli.py --test-tool CustomWikipediaSearchTool \
                  --test-input "Python (programming language)" \
                  --user-agent "MyTestAgent/1.0 (myemail@example.com)" \
                  --content-type summary
    
  • Test Current Full Text (HTML):
    python cli.py --test-tool CustomWikipediaSearchTool \
                  --test-input "Artificial Intelligence" \
                  --user-agent "MyTestAgent/1.0 (myemail@example.com)" \
                  --content-type text \
                  --extract-format HTML
    
  • Test Historical Version (Dec 31, 2022, Wikitext):
    python cli.py --test-tool CustomWikipediaSearchTool \
                  --test-input "Web browser" \
                  --user-agent "MyTestAgent/1.0 (myemail@example.com)" \
                  --revision-date "2022-12-31"
    
  • Test Historical Version (June 1, 2021, HTML):
    python cli.py --test-tool CustomWikipediaSearchTool \
                  --test-input "Quantum computing" \
                  --user-agent "MyTestAgent/1.0 (myemail@example.com)" \
                  --revision-date "2021-06-01" \
                  --extract-format HTML
    

CustomSpeechToTextTool

Transcribes audio files using Hugging Face Transformers (Whisper).

  • Example (Default Checkpoint openai/whisper-base.en):
    python cli.py --test-tool CustomSpeechToTextTool --test-input /path/to/your/audio.wav
    
  • Example (Tiny English Model):
    python cli.py --test-tool CustomSpeechToTextTool --test-input /path/to/your/audio.mp3 --checkpoint openai/whisper-tiny.en
    
  • Example (Audio URL): (Requires AgentAudio to support URL loading)
    python cli.py --test-tool CustomSpeechToTextTool --test-input https://example.com/audio.ogg
    

VideoAudioExtractorTool

Extracts audio tracks from video files.

  • Basic Test (MP3 to same directory):
    python cli.py --test-tool VideoAudioExtractorTool --test-input my_test_video.mp4
    
  • Specify Output Directory, Format (WAV):
    python cli.py --test-tool VideoAudioExtractorTool --test-input path/to/another_video.mov --output-dir ./extracted_audio --output-format wav
    
  • Specify AAC Format and Bitrate:
    python cli.py --test-tool VideoAudioExtractorTool --test-input my_video.mp4 --output-format aac --audio-quality 192k
    

๐Ÿš€ Getting Started (Local Setup)

  1. Prerequisites:

  2. Clone the Repository:

    • Initialize Git LFS: git lfs install
    • Clone the space:
      # Use an access token with write permissions as the password when prompted
      # Generate one: https://huggingface.co/settings/tokens
      git clone https://huggingface.co/spaces/DataDiva88/AutomatedProblemSolver_Final_Assignment
      
    • (Optional) To clone without downloading large LFS files immediately:
      GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/spaces/DataDiva88/AutomatedProblemSolver_Final_Assignment
      
      You might need to run git lfs pull later to fetch the actual file contents if needed.
  3. Install Dependencies:

    cd AutomatedProblemSolver_Final_Assignment
    pip install -r requirements.txt
    

    โš ๏ธ Note: This might download large model files (e.g., for Transformers/Whisper), which can take time and disk space.

  4. Configure Environment Variables: Create a .env file in the root directory or set the following environment variables:

    # --- Hugging Face (Optional, needed for private spaces/LFS upload) ---
    # HF_TOKEN=hf_YOUR_HUGGINGFACE_TOKEN
    # SPACE_ID=DataDiva88/AutomatedProblemSolver_Final_Assignment
    
    # --- Application Settings ---
    DEBUG=true
    GRADIO_DEBUG=true # For Gradio interface debugging
    LOG_LEVEL=debug   # Set log level (debug, info, warning, error)
    
    # --- API Keys (REQUIRED) ---
    # Get from https://openrouter.ai/
    LLM_API_KEY=sk-or-v1-YOUR_OPENROUTER_API_KEY
    LLM_BASE_URL=https://openrouter.ai/api/v1
    
    # Get from https://serper.dev/
    SERPER_API_KEY=YOUR_SERPER_DEV_API_KEY
    

โ–ถ๏ธ How to Use

There are a few ways to interact with the project:

  1. Gradio Web Interface:

  2. Command Line Interface (CLI) for Custom Questions & Model Experimentation:

    Use cli.py to ask your own questions and easily experiment with different Large Language Models (LLMs) for various agent roles, thanks to the integration with OpenRouter.

    • Basic Question (Uses Default Models):

      # Runs with the default LLMs specified in the code
      python cli.py --question "What is the capital of France?"
      
    • Question with a File (Uses Default Models):

      python cli.py --question "Summarize this audio file." --file-name path/to/your/audio.mp3
      
    • Overriding the Manager Agent's Model: Want the main orchestrator to use a different LLM? Use the --manager-agent-llm-id flag.

      # Use Qwen 2 72B Instruct for the main manager agent
      python cli.py --question "Plan the steps to analyze the attached chess diagram." \
                    --file-name "diagram.png" \
                    --manager-agent-llm-id qwen/qwen2-72b-instruct:free
      
    • Overriding a Specialized Agent's Model (e.g., Coding Agent): Need a different model specifically for code interpretation? Use the corresponding flag.

      # Use DeepSeek Coder for the Code Interpreter agent, keeping others default
      python cli.py --question "Explain the attached Python script's output." \
                    --file-name "script.py" \
                    --coding-llm-id tngtech/deepseek-coder:free
      
    • Overriding Multiple Models: You can combine flags to customize several agents in a single run.

      # Use Llama 4 Maverick for the Manager and Qwen 3 235B for Reasoning tasks
      python cli.py --question "Analyze the arguments in the provided text." \
                    --file-name "arguments.txt" \
                    --manager-agent-llm-id meta-llama/llama-4-maverick:free \
                    --reasoning-agent-llm-id qwen/qwen3-235b-a22b
      

    How it Works:

    • The cli.py script accepts arguments like --<agent_role>-llm-id (e.g., --manager-agent-llm-id, --worker-agent-llm-id, --reasoning-agent-llm-id, --multimodal-llm-id, --coding-llm-id, etc.).
    • These arguments directly override the default models defined in the DefaultAgentLLMs class within the AutoPS core code (AutoPS/core.py or similar).
    • Specify the model using its OpenRouter identifier (e.g., meta-llama/llama-4-maverick:free). You can find available models on the OpenRouter Models page.
    • This makes it incredibly simple to test how different models perform for specific roles (manager, coding, reasoning, multimodal) without changing the core agent code.

  1. Run Specific Assignment Tasks (tasks.py): The tasks.py script allows you to run the predefined assignment questions.

    • Run ALL predefined tasks:
      python tasks.py
      
    • Run a SINGLE task by its ID:
      # Example: Run the first task
      python tasks.py 8e867cd7-cff9-4e6c-867a-ff5ddc2550be
      
      # Example: Run the task involving the chess image
      python tasks.py cca530fc-4052-43b2-b130-b30968d8aa44
      

๐Ÿ“Š Telemetry & Debugging

This project uses OpenInference and Phoenix for observability and tracing agent runs.

  1. Start the Phoenix UI:
    python -m phoenix.server.main serve
    
  2. Access the UI: Open your browser to http://localhost:6006/projects
  3. Now, when you run tasks via cli.py or tasks.py, the agent interactions, tool usage, and LLM calls will be traced and viewable in the Phoenix UI.
  4. Set the LOG_LEVEL=debug environment variable for more verbose console output.

๐Ÿ“ Development Notes & Future Work

Based on initial development and testing, here are some areas for improvement:

  • Agent Naming: Rename clarification_agent to something more descriptive if its role evolves.
  • Model Experimentation: Continue trying different models for various agents via OpenRouter (e.g., test featherless/qwerky-72b:free, open-r1/olympiccoder-32b:free more extensively).
  • Prompt Engineering: Refine the prompts (TASK_PROMPT_TEMPLATE, RESOURCE_CHECK_TEMPLATE, and internal agent prompts) for better clarity, task decomposition, and result quality.
  • Planning Capabilities: Add explicit planning steps to agents like the code_interpreter_agent and multimedia_analysis_agent to break down complex tasks more robustly.
  • Manager Capabilities: Consider giving the chief_problem_solver_agent access to all tools/capabilities (similar to a reasoning agent) for more flexibility in handling complex, multi-step problems directly if needed.
  • PDF Support: PDF support for the agents could be improved. Maybe with a dedicated tool.

Hugging Face Space Configuration

This project is configured to run as a Hugging Face Space using the following settings (./.huggingface/README.md metadata):

  • SDK: Gradio (sdk: gradio)
  • SDK Version: 5.25.2 (sdk_version: 5.25.2)
  • Application File: app.py (app_file: app.py)
  • OAuth: Enabled for potential HF features (hf_oauth: true)
  • Config Reference

Happy agent building! Let me know if you have questions.


Content loaded from /home/user/app/README.md