Streamline code analysis with AI chatbots using source code gatherer

Learn how to use a simple Python script to gather source code from your projects for efficient analysis using AI chatbots like Claude or ChatGPT

Discover how to enhance your code analysis workflow using AI chatbots with a simple Python script that gathers source code from specified project folders. This tool helps you prepare your code for AI analysis while maintaining full control over the context provided to the AI.

Introduction

While integrated development environments (IDEs) offer sophisticated code analysis tools, AI chatbots like Claude and ChatGPT have emerged as powerful alternatives for code understanding, documentation, and problem-solving. However, feeding your codebase to these AI tools can be cumbersome. This article introduces a simple Python script that streamlines this process by gathering all source code from a specified directory into a single text file.

Key Features

The Source Code Gatherer script provides a straightforward way to collect all non-binary files from a project directory for AI analysis.


The script offers these main benefits:
  1. Selective Context Control:

    • Choose specific folders to analyze
    • Exclude binary files automatically
    • Control the scope of code being analyzed
  2. Universal Compatibility:

    • Works with any AI chatbot
    • No special integration required
    • Simple text output format

The Script

Options:

  • Download the file directly: gather_source_code.py
  • Copy the script below and save it as gather_source_code.py
import os
import mimetypes
import pathlib
from typing import List, Set

def is_text_file(filepath: str, text_extensions: Set[str]) -> bool:
   """
   Determine if a file is a text file based on its extension and mime type.
   """
   # Check if extension is in our allowed list
   ext = pathlib.Path(filepath).suffix.lower()
   if ext in text_extensions:
      return True

   # Use mime type as fallback
   mime_type, _ = mimetypes.guess_type(filepath)
   return mime_type is not None and mime_type.startswith('text/')

def collect_files(directory: str, text_extensions: Set[str]) -> List[str]:
   """
   Recursively collect all text files in the given directory.
   """
   text_files = []
   for root, _, files in os.walk(directory):
      for file in files:
         filepath = os.path.join(root, file)
         if is_text_file(filepath, text_extensions):
            text_files.append(filepath)

   # Sort for consistent output
   return sorted(text_files)

def create_combined_file(files: List[str], output_file: str):
   """
   Create a single file containing the content of all input files with proper formatting.
   """
   with open(output_file, 'w', encoding='utf-8') as outfile:
      for filepath in files:
         try:
            with open(filepath, 'r', encoding='utf-8') as infile:
               # Write file header
               outfile.write(f"// Filepath: {filepath}\n\n\n")
               outfile.write("```\n")

               # Write file content
               outfile.write(infile.read())

               # Write file footer
               outfile.write("\n```\n\n\n")
         except UnicodeDecodeError:
            print(f"Warning: Could not read {filepath} as text. Skipping.")
         except Exception as e:
            print(f"Error processing {filepath}: {str(e)}")

def main():
   # Define the extensions you want to include
   text_extensions = {
      '.h', '.cpp', '.cs', '.py', '.json', '.xml', '.txt', '.md',
      '.ini', '.config', '.yaml', '.yml', '.uplugin', '.build',
      '.html', '.css', '.js', '.java', '.swift', '.m', '.mm',
      '.sh', '.bat', '.cmd', '.ps1', '.gradle', '.properties'
   }

   # Get directory from command line argument or use current directory
   import sys
   directory = sys.argv[1] if len(sys.argv) > 1 else '.'
   output_file = 'combined_source_code.txt'

   # Collect and process files
   print(f"Scanning directory: {directory}")
   files = collect_files(directory, text_extensions)
   print(f"Found {len(files)} text files")

   # Create combined file
   create_combined_file(files, output_file)
   print(f"Created combined file: {output_file}")

if __name__ == '__main__':
   main()

Usage Guide

Basic Usage

  1. Save the script to your local machine
  2. Open a terminal or command prompt
  3. Run the script with a directory path:
    python gather_source_code.py "C:\MyProject"
    

The script will create a file named gathered_source_code.txt in the current directory, containing all the source code from the specified folder.

Output Format

The generated file will contain all source code with clear separators between files:

// Filepath:  src/main.py

[Content of main.py]

// Filepath: src/utils/helper.py

[Content of helper.py]

Best Practices

Optimizing Your Workflow

  1. Choose the Right Scope:

    • Select specific subdirectories for focused analysis
    • Avoid including unnecessary files like build outputs or dependencies
  2. Managing Large Codebases:

    • Break down large projects into smaller chunks
    • Consider AI platform token limits
    • Focus on related code files for better context
  3. Effective AI Queries:

    • Provide clear questions about the gathered code
    • Reference specific files or functions in your questions
    • Consider including relevant documentation files

Benefits Over IDE Integration

While many IDEs offer direct AI integration, this script provides several unique advantages:

  1. Platform Independence:

    • Works with any AI chatbot
    • No vendor lock-in
    • Simple to modify and customize
  2. Context Control:

    • Precise control over what code is included
    • Easy to exclude irrelevant files
    • Maintain focus on specific code areas
  3. Simplicity:

    • No complex setup required
    • Works with any project structure
    • Easy to understand and modify

Conclusion

The Source Code Gatherer script provides a simple yet effective way to prepare your code for AI analysis. By streamlining the process of collecting source code, it allows developers to focus on getting meaningful insights from AI chatbots rather than wrestling with how to provide code context to them.