Processing and working with files is a necessary skill for Python developers. Glob (short for Global) is a powerful Python built-in module that allows you to easily find all the files matching a specified pattern in a directory tree. It is a handy tool that makes it easy to search and retrieve files in your file system and directory without the need to navigate through directories manually. In this tutorial, we will explore how to use the glob module in Python.
Getting Started with Glob in Python
Glob is a standard Python module, so you don’t need to install anything to start using it. To get started, you need to import the glob module into your Python script:
import glob
Glob Patterns
Glob patterns are special patterns used to match filenames in a directory tree. They are similar to regular expressions but are much simpler to use. The following are some of the most commonly used globbing patterns:
*
– matches any string of characters, including the empty string.?
– matches any single character.[set]
– matches any character in the specified set of characters. You can also specify ranges using the dash (-) character.[!set]
– matches any character that is not in the specified set of characters.
Basic Glob usage
Now that you know about globbing patterns, let’s look at how to use them with the glob module. The glob module provides a single function, glob.glob()
, which takes a glob pattern as its argument and returns a list of filenames that match the pattern.
import glob
files = glob.glob('/path/to/files/*.txt')
print(files)
In this example, the glob.glob()
function returns a list of all the files in the /path/to/files/
directory that have a .txt
extension.
Recursive Globbing
Sometimes, you may want to search for files not only in a single directory but also in its subdirectories. The glob module makes it easy to search for files recursively using the **
globbing pattern in glob.glob(). The **
pattern matches any number of directories, including none, so it can be used to search for files in the current directory and all its subdirectories.
import glob
files = glob.glob('/path/to/files/**/*.txt', recursive=True)
print(files)
In this example, the glob()
function searches for all the files in the /path/to/files/
directory and all its subdirectories that have a .txt
extension.
Access multiple csv files with glob and pandas
Now imagine there is a folder with several csv files. Let’s access them using glob and add them to a Pandas dataframe
import pandas as pd
import glob
# Set path to the folder containing CSV files
path = '/path/to/csv/files/*.csv'
# Use glob to get a list of all CSV files in the folder
files = glob.glob(path)
# Initialize an empty DataFrame to store the combined data
df = pd.DataFrame()
# Loop through the files and concatenate them into a single DataFrame
for file in files:
temp_df = pd.read_csv(file)
df = pd.concat([df, temp_df])
# Print the combined DataFrame
print(df)
In this example, we first use the glob()
function to get a list of all CSV files in the folder. We then initialize an empty Pandas DataFrame to store the combined data. Finally, we loop through the list of files, read each one into a temporary DataFrame, and concatenate it with the main DataFrame using the concat()
function. At the end of the loop, we print the combined DataFrame to verify that all the data has been successfully loaded.
Note that if the CSV files have different column names or datatypes, you may need to specify additional arguments when reading them into Pandas using pd.read_csv()
. For example, you may need to use the dtype
or header
arguments to ensure that all the data is correctly parsed.
Access multiple text files with glob
Reading text files in a folder using Glob in Python is very similar to reading CSV files, as shown in the previous example. Here’s an example of how you can use Glob to access all text files in a folder and read their contents:
import glob
# Set path to the folder containing text files
path = '/path/to/text/files/*.txt'
# Use glob to get a list of all text files in the folder
files = glob.glob(path)
# Loop through the files and read their contents
for file in files:
with open(file, 'r') as f:
contents = f.read()
print(contents)
In this example, we first use the glob()
function to get a list of all text files in the folder. We then loop through the list of files and use the open()
function to read the contents of each file. The with
statement is used to automatically close the file when we’re done reading from it.
Note that the open()
function is used with the mode 'r'
to open the file in read-only mode. You can also use the readlines()
method instead of read()
to read the file contents into a list of lines. Additionally, you can specify the encoding of the file if it’s not in the default UTF-8 encoding, by passing the encoding parameter to the open()
function.
By default, the glob()
function returns the file paths in lexicographic order, which may not be the order you want. You can sort the list of file paths returned by glob()
using the sorted()
function if you need to process them in a specific order.
accessing and Sorting files using glob
You can sort the files returned by glob()
in various ways depending on your requirements. Here are some examples:
- Sort files by name: You can sort the files alphabetically by name using the
sorted()
function. This is the default sorting behavior ofglob()
. For example:
import glob
# Get a list of all text files in the folder sorted by name
files = sorted(glob.glob('/path/to/text/files/*.txt'))
2. Sort files by creation time: You can sort the files based on their creation time using the os.path.getctime()
function. For example:
import glob
import os
# Get a list of all text files in the folder sorted by creation time
files = sorted(glob.glob('/path/to/text/files/*.txt'), key=os.path.getctime)
3. Sort files by modification time: You can sort the files based on their modification time using the os.path.getmtime()
function. For example:
import glob
import os
# Get a list of all text files in the folder sorted by modification time
files = sorted(glob.glob('/path/to/text/files/*.txt'), key=os.path.getmtime)
Sort files by size: You can sort the files based on their size using the os.path.getsize()
function. For example:
import glob
import os
# Get a list of all text files in the folder sorted by size
files = sorted(glob.glob('/path/to/text/files/*.txt'), key=os.path.getsize)
Note that in all the above examples, we first use glob()
to get a list of all text files in the folder, and then we use the sorted()
function to sort the list based on a specific sorting key. The key
parameter is set to a function that returns the value that we want to use for sorting. For example, os.path.getctime()
returns the creation time of a file.
Conclusion
The glob module in Python is a powerful tool that can save you a lot of time when searching for files in a directory tree. With its easy-to-use globbing patterns and recursive searching capabilities, it’s a great tool for any Python developer to have in their arsenal.