The need to organize and process documentation efficiently (for my Chromdb) was very crucial, as I wanted to feed these files to my LLM. Markdown files (.md) are widely used for documentation due to their simplicity and versatility. However, managing these files across complex directory structures can become cumbersome. In my scenario, I had a repository of around 7 GB of files, and documentation files were all at random places in this directory. In this blog post, we will explore a Python script designed to simplify this task by copying all Markdown files from a source directory (including its subdirectories) to a single destination directory, thereby flattening the structure. This script also smartly handles potential filename conflicts to ensure no data is lost.
Understanding the Problem
You know how it goes when you are deep into a big project. Documentation ends up everywhere, spread out through a maze of folders. Ever thought about how much easier life would be if all those Markdown files were just in one spot? Imagine whipping up a sleek script that is zipping through batch processing without the scavenger hunt. But here's the challenge: we have got to do it without playing a dangerous game of file-name roulette, risking overwrites and losing crucial info. Sound like a problem you've faced? I did, when I was making an LLM using ChromaDB.
There are two ways of doing this...
Of course, there is the traditional step-by-step guide along with code explanation and code samples that I provide in this blog, and you can follow along OR OR OR I give you an outline on how you can make this script by yourself by just reading the instructions. Sounds fun? Let's start with the outline for those who want to code their way through all by themselves:
D-I-Y:
Step 1: Importing Necessary Modules
The script begins by importing the required Python modules:
os
: Used for interacting with the operating system, enabling us to walk through the directory tree and work with file paths.shutil
: Offers a collection of high-level file operations, crucial for copying files and preserving metadata.
Step 2: Defining the copy_md_files_flat
Function
The core of the script is encapsulated in the copy_md_files_flat
function, which accepts two parameters:
source_dir
: The path to the directory from which the Markdown files will be copied.dest_dir
: The path to the destination directory where all Markdown files will be collected.
The function then iterates over each file in the source directory and its subdirectories, looking specifically for files that end with the .md
extension.
Step 3: Copying Files and Handling Conflicts
For each Markdown file found, the script constructs the full source path and the intended destination path. Before proceeding with the copy operation, it checks if the destination path already exists. If a conflict is detected (i.e., a file with the same name exists in the destination directory), the script employs a clever mechanism to resolve it:
It splits the filename into its base and extension components.
It then enters a loop, appending a counter to the base name until a unique filename is achieved.
This new, unique filename is used as the destination path.
Using shutil.copy2
, the file is copied from the source to the destination, with the method ensuring that file metadata is also preserved. A message is printed to the console indicating the successful copy, including the source and destination paths for verification.
Do-It-Along:
This script is designed to copy all Markdown (.md
) files from a source directory, including its subdirectories, to a single destination directory while handling potential filename conflicts by renaming duplicated filenames. Here's a detailed breakdown of each part of the code:
Import Statements
import os
import shutil
import os
: This imports theos
module, which provides a way of using operating system-dependent functionality like reading or writing to the file system.import shutil
: This imports theshutil
module, which offers high-level file operations, such as copying files.
Function Definition
def copy_md_files_flat(source_dir, dest_dir):
def
: This keyword is used to define a function in Python.copy_md_files_flat
: This is the name of the function. It's descriptive of its purpose: to copy Markdown files to a flat structure in the destination directory.source_dir, dest_dir
: These are parameters for the function.source_dir
is the directory from where the.md
files will be copied anddest_dir
is the destination directory where the.md
files will be placed.
Walking the Directory Tree
for root, dirs, files in os.walk(source_dir):
os.walk(source_dir)
: This function generates the file names in a directory tree by walking either top-down or bottom-up. For each directory in the tree rooted at the directorysource_dir
, it yields a 3-tuple(dirpath, dirnames, filenames)
.root
: The current directory path it's iterating over.dirs
: A list of subdirectories in the currentroot
.files
: A list of files in the currentroot
.
Iterating Over Files
for file in files:
if file.endswith(".md"):
This loop iterates through each file in the list of
files
.if file.endswith(".md")
: This line checks if the current file's name ends with the.md
extension, indicating it's a Markdown file.
Constructing File Paths
source_file_path = os.path.join(root, file)
dest_file_path = os.path.join(dest_dir, file)
os.path.join(root, file)
: This combines theroot
directory path and thefile
name into a full path to the source file.os.path.join(dest_dir, file)
: This constructs the full path to where the file should be copied in the destination directory.
Handling Filename Conflicts
if os.path.exists(dest_file_path):
base, extension = os.path.splitext(file)
counter = 1
while os.path.exists(dest_file_path):
new_filename = f"{base}_{counter}{extension}"
dest_file_path = os.path.join(dest_dir, new_filename)
counter += 1
if os.path.exists(dest_file_path)
: Checks if a file with the same name already exists in the destination directory.os.path.splitext(file)
: Splits the file name into a base name and extension. This is used to generate a new file name if there's a conflict.while os.path.exists(dest_file_path)
: If there's a conflict, this loop keeps running until a unique filename is found by appending a counter to the base name.
Copying the File
shutil.copy2(source_file_path, dest_file_path)
shutil.copy2
: This function is used to copy the file atsource_file_path
todest_file_path
. It's similar toshutil.copy()
, butcopy2
also attempts to preserve file metadata.
Printing the Action
print(f"Copied: {source_file_path} to {dest_file_path}")
- This prints a message to the console indicating that a file has been copied, showing both the source and destination paths.
Example Usage
source = r"/Users/mustafasaifee/Local Disk D/terraformDocs/terraform-provider-harness"
destination = r"/Users/mustafasaifee/Local Disk D/terraformDocs/md_only"
copy_md_files_flat(source, destination)
- These lines set the
source
anddestination
directory paths and then call thecopy_md_files_flat
function with these paths. It demonstrates how to use the function in a real scenario. Note that the paths provided are examples and should be changed to reflect the actual paths you want to work with.
Here's the complete code for you to use:
import os
import shutil
def copy_md_files_flat(source_dir, dest_dir):
for root, dirs, files in os.walk(source_dir):
for file in files:
if file.endswith(".md"):
source_file_path = os.path.join(root, file)
dest_file_path = os.path.join(dest_dir, file)
# Handle potential file name conflicts
if os.path.exists(dest_file_path):
base, extension = os.path.splitext(file)
counter = 1
while os.path.exists(dest_file_path):
new_filename = f"{base}_{counter}{extension}"
dest_file_path = os.path.join(dest_dir, new_filename)
counter += 1
shutil.copy2(source_file_path, dest_file_path)
print(f"Copied: {source_file_path} to {dest_file_path}")
# Example usage:
source = r"/Users/mustafasaifee/Local Disk D/terraformDocs/terraform-provider-harness" # Change this to your source directory
destination = r"/Users/mustafasaifee/Local Disk D/terraformDocs/md_only" # Change this to your destination directory
copy_md_files_flat(source, destination)
Conclusion:
The script we built smartly navigates the potential minefield of filename conflicts, ensuring that no file is accidentally overwritten. This level of care makes it an invaluable tool for documentation management. Whether you're setting the stage for an elegant documentation portal or simply streamlining your project's file structure, this script acts as your behind-the-scenes powerhouse, freeing you up to focus on crafting top-notch content. It's professional, it's efficient, and above all, it lets you concentrate on the creative aspects of your work.