Loading and Inspecting Data

First steps with any dataset

import pandas as pd
 
df = pd.read_csv("dataset.csv")
 
# Always run these first
print(df.shape)           # how big is it?
print(df.head())          # what does it look like?
print(df.info())          # column types, missing values
print(df.describe())      # statistics for numeric columns
print(df.isnull().sum())  # missing values per column

Common formats

FormatLoad withNotes
CSVpd.read_csv()Universal, but slow for large files
Parquetpd.read_parquet()Fast, compressed, preserves types — use for large data
JSONpd.read_json()Nested data, API responses
Excelpd.read_excel()Needs openpyxl
SQLpd.read_sql()Direct from database

Questions to ask about any dataset

  • How many samples? How many features?
  • What are the column types? (numeric, categorical, text, datetime)
  • How much is missing? Is it random or systematic?
  • What is the target variable? Is it balanced?
  • Are there duplicates?