Importing Data in Python

Data rarely arrives in the format you would choose. Learn how to import it into Python from text, CSV, Excel, databases, APIs, and the web, with clean, reliable patterns.

Before you can analyze anything, you have to get the data into Python, and the data rarely arrives in the format you would choose. It comes as text files, CSVs, Excel workbooks, the proprietary formats of SAS and Stata, scientific containers like HDF5 and MATLAB, SQL databases, and increasingly from the web through APIs and scraped HTML. This article is a practical tour of all of them, with the patterns that keep your code clean and your file handles from leaking.

Text Files and the Context Manager

Reading a text file is two steps: open it to get a handle, then read from it. The naive way pairs open() with close(), but if anything between those two calls fails, the file stays open and leaks.

file = open("moby_dick.txt", mode="r")
print(file.read())
file.close()

The better pattern is the context manager, which closes the file automatically when the block exits, even if an error fires inside it.

with open("moby_dick.txt") as file:
print(file.readline()) # first line
print(file.readline()) # next line
print(file.read()) # the rest

Two reading methods matter here. file.read() slurps all remaining content as one big string, while file.readline() reads a single line and advances an internal cursor, so successive calls walk through the file line by line. Because they share that cursor, mixing them means each read picks up exactly where the last one stopped. The mode argument controls access: "r" reads text (the default) and "rb" reads binary, which you need for formats like pickle.

Flat Files with NumPy

When you want a flat file as a pure NumPy array, with no column names and no DataFrame overhead, np.loadtxt is the tool.

import numpy as np
readings = np.loadtxt("sensor_log.txt",
delimiter="\t", # tab-separated
skiprows=1, # skip the header row
usecols=[0, 2], # keep only columns 0 and 2
dtype=float
)

You tell it the delimiter (a comma for CSV, a tab for TSV), how many header rows to skip, which columns to keep, and the data type. The result is a plain NumPy array: fast and memory-efficient, but stripped of names. Reach for this only when you specifically want an array for math and do not need pandas niceties. For most everyday work, the pandas reader is the better default.

Flat Files with Pandas

pd.read_csv is the workhorse of data loading. It auto-detects the delimiter, headers, and types, and hands back a fully featured DataFrame.

import pandas as pd
passengers = pd.read_csv("voyage_log.txt",
sep="\t", # tab-delimited
nrows=5, # only the first 5 rows
header=None, # the file has no header row
comment="#", # ignore lines starting with #
na_values="Nothing" # treat "Nothing" as NaN
)
passengers.head()

The optional arguments handle non-standard files. Use sep="\t" for tab-separated data, nrows to peek at a huge file without loading the whole thing, header=None when the file lacks a header so pandas numbers the columns itself, comment to skip comment lines common in scientific exports, and na_values to convert specific strings into proper missing values. This one function covers the large majority of loading tasks you will ever do.

When you need to hand the data to a library that wants a plain array, such as many scikit-learn functions, drop the labels with np.array(df). The equivalent modern call is df.to_numpy(), and the older df.values still appears everywhere.

Pickle: Python’s Native Object Store

Pickle saves and restores arbitrary Python objects: dicts, lists, trained models, custom class instances, anything. The file is binary, so you open it with "rb".

import pickle
with open("model_state.pkl", "rb") as file:
payload = pickle.load(file)
print(type(payload))

The recovered object comes back exactly as it was, with its full type intact. One critical caution applies: never load a pickle file from a source you do not trust, because pickle.load can execute arbitrary code during deserialization. Opening a malicious pickle is effectively running someone else’s Python on your machine, so stick to pickles you created or got from a clearly trusted source.

Excel Workbooks

An Excel file is a workbook of multiple sheets, so loading it is naturally a two-step process: open the workbook, then parse the sheets you want.

import pandas as pd
book = pd.ExcelFile("conflict_deaths.xlsx")
print(book.sheet_names) # inspect the sheets
deaths_2004 = book.parse("2004") # by sheet name
first_sheet = book.parse(0) # by index
summary = book.parse(0,
skiprows=[0],
names=["Country", "Deaths"],
usecols=[0]
)

pd.ExcelFile opens the workbook without reading any data, which lets you inspect sheet_names first. Then .parse() loads a specific sheet by name or by zero-based index, accepting the same row-skipping, renaming, and column-selecting options as the CSV reader. For one-off reads, the shortcut pd.read_excel("file.xlsx", sheet_name="2004") does both steps in a single line. Use the two-step form when you need to see the sheet names first or pull several sheets from the same file.

SAS and Stata Files

SAS and Stata are statistical packages with proprietary formats you will meet in social science, economics, healthcare, and pharmaceutical work. SAS needs an extra library, while Stata is built into pandas.

from sas7bdat import SAS7BDAT
with SAS7BDAT("sales_history.sas7bdat") as file:
sales = file.to_data_frame()
import pandas as pd
disease_area = pd.read_stata("disease_area.dta")

Open the SAS file as a context manager and call .to_data_frame() to convert it. The Stata case is a one-liner with pd.read_stata. Both return ordinary DataFrames once loaded.

HDF5 and MATLAB: Scientific Formats

HDF5 is the standard for very large scientific datasets, and its defining feature is hierarchy: an HDF5 file is like a filesystem inside a single file, with groups acting as folders and datasets as files, nested arbitrarily.

import h5py
import numpy as np
observatory = h5py.File("detector_run.hdf5", "r")
for key in observatory.keys():
print(key) # top-level groups
signal = np.array(observatory["strain"]["Strain"])

You open the file with h5py.File, explore its structure by iterating .keys() and indexing into groups like nested dictionaries, and materialize a dataset into memory with np.array only when you reach one. The payoff is lazy reads: a file can be hundreds of gigabytes and you load only the slice you actually need.

MATLAB’s .mat files store named variables, essentially a saved workspace, and scipy.io.loadmat returns them as a dictionary.

import scipy.io
workspace = scipy.io.loadmat("experiment.mat")
print(workspace.keys()) # variable names
readings = workspace["CYratioCyt"]
window = readings[25, 5:] # standard NumPy slicing

Each key is a MATLAB variable name and each value is usually a NumPy array, so once loaded it is just dictionary access plus normal array operations.

SQL Databases

SQLAlchemy is the standard way Python talks to databases. You create an engine from a connection string that names the dialect, location, and credentials, and the same code works across database types by changing only that string.

from sqlalchemy import create_engine
import pandas as pd
engine = create_engine("sqlite:///media_store.sqlite")

The sqlite:///path form points at a local SQLite file, the easiest setup for learning. To list tables, newer SQLAlchemy uses inspect(engine).get_table_names(), while older versions had engine.table_names().

There are three ways to run a query, in ascending order of convenience. The manual route opens a connection, executes the SQL, fetches rows, wraps them in a DataFrame, copies the column names across, and closes the connection, which is fragile because forgetting the close leaks the connection.

con = engine.connect()
result = con.execute("SELECT * FROM Album")
albums = pd.DataFrame(result.fetchall())
albums.columns = result.keys()
con.close()

Wrapping it in a context manager fixes the leak by closing automatically, the same pattern as with open(...) for files.

with engine.connect() as con:
result = con.execute("SELECT LastName, Title FROM Employee")
staff = pd.DataFrame(result.fetchmany(size=3))
staff.columns = result.keys()

But the cleanest option by far is to let pandas do everything in one call, handling the connection, fetch, and column naming for you.

albums = pd.read_sql_query("SELECT * FROM Album", engine)

Use this for almost every ad-hoc query, dropping to manual connections only when you need transaction control or several queries on a shared connection.

The SQL itself follows a predictable shape: choose columns with SELECT, name the table with FROM, filter rows with WHERE, sort with ORDER BY, and combine tables with INNER JOIN on a matching key.

SELECT Title, Name
FROM Album
INNER JOIN Artist ON Album.ArtistId = Artist.ArtistId
WHERE Album.ArtistId >= 6
ORDER BY Title

Those four clauses plus joins cover the vast majority of analytical queries.

Pulling Files Straight From the Web

Pandas accepts a URL anywhere it accepts a file path, which gives you two ways to load a remote file. If you want a permanent local copy, download it first.

from urllib.request import urlretrieve
import pandas as pd
source = "https://example.com/wine_quality_red.csv"
urlretrieve(source, "wine_quality_red.csv")
wine = pd.read_csv("wine_quality_red.csv", sep=";")

urlretrieve saves the file to your working directory, after which you read it like any local file. Use this when you want an offline artifact that survives the URL changing. If you just need the data now, skip the download and hand the URL straight to the reader.

wine = pd.read_csv("https://example.com/wine_quality_red.csv", sep=";")

This is best for one-off analysis. The trade-off is dependence on the network and the source staying available, and that a later rerun might fetch different data. Excel works the same way, with one trick: passing sheet_name=None returns a dictionary mapping each sheet name to its DataFrame.

sheets = pd.read_excel("https://example.com/latitudes.xls", sheet_name=None)
print(sheets.keys())
print(sheets["1700"].head())

HTTP Requests: urllib and requests

Python’s built-in urllib makes HTTP requests in three verbose steps: build a request, send it, read the response, and close it manually.

from urllib.request import urlopen, Request
page = "https://example.com/dataset/notes"
request = Request(page)
response = urlopen(request)
html = response.read()
response.close()

It works and needs no install, but in practice almost everyone uses the third-party requests library, which collapses all of that into a single call and closes the connection for you.

import requests
r = requests.get("https://example.com/dataset/notes")
text = r.text # body as a decoded string

The response object exposes clean attributes: .text for the body, .json() for parsed JSON, and .status_code for the HTTP status. Use requests unless you have a specific reason not to.

Scraping HTML with BeautifulSoup

Once you have a page’s HTML as a string, extracting specific pieces with raw string operations is miserable. BeautifulSoup parses it into a navigable tree you can query.

import requests
from bs4 import BeautifulSoup
r = requests.get("https://example.com/profile")
soup = BeautifulSoup(r.text)
print(soup.title) # the <title> tag
print(soup.get_text()) # all visible text, tags stripped
for link in soup.find_all("a"):
print(link.get("href")) # the URL from each link

The pattern is two steps: requests.get fetches the HTML, BeautifulSoup parses it. From there, prettify() reformats the markup for reading, .title grabs the title tag, .get_text() returns just the visible text, and .find_all("a") returns every anchor tag. Each tag is itself an object whose attributes you read with .get("href"). That handful of methods covers most basic scraping.

Working with JSON and APIs

JSON is the lingua franca of the web, and it maps almost directly onto Python types: a JSON object becomes a dict, an array becomes a list, and nesting nests naturally. From a file, use json.load on a file handle (or json.loads on a string already in memory).

import json
with open("a_movie.json") as json_file:
movie = json.load(json_file)
for key in movie.keys():
print(key, movie[key])

From an API, requests makes it trivial: hit the URL and call .json() on the response to parse the body straight into a dict.

import requests
api = "http://www.omdbapi.com/?apikey=72bc447a&t=the+social+network"
r = requests.get(api)
film = r.json()

Real APIs usually return deeply nested JSON, and you reach a value by chaining key lookups, like navigating folders.

wiki = "https://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=pizza"
data = requests.get(wiki).json()
extract = data["query"]["pages"]["24768"]["extract"]

The hardest part is usually figuring out the structure in the first place, so print the raw JSON or use a viewer to map the layout, then chain the keys to reach what you want. The URL itself carries query parameters after the ?, joined by &, with = assigning each value and + standing in for a space.

A Complete Pipeline: From Live Tweets to a Chart

To see these pieces combine, here is the full path from streaming social data to a finished plot. First, collect tweets with Tweepy, which needs four credentials from a developer account and filters a live stream to your keywords.

import tweepy
stream = tweepy.Stream(consumer_key, consumer_secret,
access_token, access_token_secret)
stream.filter(track=["budget", "deficit", "inflation", "tariffs"])

The stream saves each tweet as one JSON object per line, a format called JSON Lines, so you load it by iterating the file and parsing each line individually.

import json
posts = []
with open("posts.txt", "r") as posts_file:
for line in posts_file:
posts.append(json.loads(line))
print(posts[0].keys()) # the fields each tweet carries

Pandas turns that list of dicts into a DataFrame, keeping only the columns you need.

import pandas as pd
df = pd.DataFrame(posts, columns=["text", "lang"])

A small text-mining step then counts keyword mentions. The helper lowercases both inputs so matching is case-insensitive, and the counting loop exploits the fact that in Python True behaves as 1 and False as 0, so adding the boolean result increments the counter only on a match.

import re
def word_in_text(word, text):
return bool(re.search(word.lower(), text.lower()))
[budget, deficit, inflation, tariffs] = [0, 0, 0, 0]
for index, row in df.iterrows():
budget += word_in_text("budget", row["text"])
deficit += word_in_text("deficit", row["text"])
inflation += word_in_text("inflation", row["text"])
tariffs += word_in_text("tariffs", row["text"])

Finally, two parallel lists of labels and counts feed a bar chart.

import matplotlib.pyplot as plt
import seaborn as sns
labels = ["budget", "deficit", "inflation", "tariffs"]
counts = [budget, deficit, inflation, tariffs]
ax = sns.barplot(x=labels, y=counts)
ax.set(ylabel="count")
plt.show()

From a live firehose to a finished chart in around thirty lines, which is the Python data ecosystem at its best.

The Decision Tree

Choosing a loader comes down to the source. For plain text and CSV, reach for pd.read_csv, dropping to np.loadtxt only when you specifically want a pure array. Excel files go through pd.ExcelFile and .parse, or the pd.read_excel shortcut. Statistical formats split by tool: SAS through the sas7bdat library, Stata through pd.read_stata. Scientific data uses h5py for HDF5 and scipy.io.loadmat for MATLAB. Databases use SQLAlchemy with pd.read_sql_query. Arbitrary Python objects use pickle, with the trust caveat. And for the web, download with urlretrieve or read the URL directly, make requests with requests, scrape with BeautifulSoup, and consume APIs by calling .json() on the response.

Conclusion

Getting data into Python is mostly a matter of matching the source to the right reader and then leaning on a few consistent habits. Prefer context managers so files and connections close themselves, prefer pandas readers for tabular data because one function handles most cases, and prefer requests and .json() for the web because they are clean and close connections for you. Treat pickle with suspicion when its source is unknown, use lazy reads for scientific files too big for memory, and let pd.read_sql_query hide the database plumbing. Learn that small toolkit and almost any file or feed becomes a DataFrame you can actually work with.

View Comments (1)

Leave a Reply

Prev Next

Subscribe to My Newsletter

Subscribe to my email newsletter to get the latest posts delivered right to your email. Pure inspiration, zero spam.

Discover more from Datalad - Data Science and ML

Subscribe now to keep reading and get access to the full archive.

Continue reading