Libraries - ML Pathway - Obsidian Publish

Machine learning has become an integral part of modern technology, and Python is one of the most popular languages for developing machine learning models. The success of Python in machine learning is largely due to its rich ecosystem of powerful libraries that simplify data preprocessing, model building, and evaluation. In this article, we will explore some of the most essential Python libraries for machine learning. ### 1. NumPy **Purpose:** Numerical computing NumPy (Numerical Python) is the backbone of machine learning computations. It provides powerful support for handling large arrays and matrices efficiently. Many other ML libraries, including TensorFlow and scikit-learn, rely on NumPy for numerical operations. **Installation:** ```bash pip install numpy ``` #### Array Creation NumPy offers multiple ways to create arrays, which are the fundamental data structure in NumPy: ```python import numpy as np # Basic array from a list arr = np.array([1, 2, 3, 4]) print(arr) # Output: [1 2 3 4] # Utility functions for array creation zeros = np.zeros((3, 4)) # Create 3x4 array of zeros ones = np.ones((2, 2)) # Create 2x2 array of ones rng = np.arange(0, 10, 2) # Create array [0, 2, 4, 6, 8] lin = np.linspace(0, 1, 5) # Create 5 evenly spaced points between 0 and 1 ``` #### Array Reshaping and Manipulation Changing array dimensions and structure is essential for preparing data for ML algorithms: ```python # Reshaping arrays arr = np.arange(12) reshaped = arr.reshape(3, 4) # Reshape to 3x4 matrix print(reshaped) # Output: # [[ 0 1 2 3] # [ 4 5 6 7] # [ 8 9 10 11]] # Transposing arrays transposed = reshaped.T # Transpose rows and columns print(transposed) # Output: # [[ 0 4 8] # [ 1 5 9] # [ 2 6 10] # [ 3 7 11]] # Flattening arrays flattened = reshaped.flatten() # Convert back to 1D array print(flattened) # Output: [ 0 1 2 3 4 5 6 7 8 9 10 11] ``` #### Indexing and Slicing NumPy provides powerful ways to access and modify specific elements or subsets of arrays: ```python # Array indexing and slicing arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) # Single element access print(arr_2d[0, 1]) # Access element at row 0, column 1: 2 # Slicing print(arr_2d[:2, 1:]) # First 2 rows, columns 1 onwards # Output: # [[2 3] # [5 6]] # Boolean indexing mask = arr_2d > 5 print(mask) # Boolean mask where elements > 5 # Output: # [[False False False] # [False False True] # [ True True True]] print(arr_2d[mask]) # Select elements where mask is True # Output: [6 7 8 9] ``` #### Broadcasting Broadcasting allows NumPy to perform operations on arrays of different shapes: ```python # Broadcasting (operating on arrays of different shapes) a = np.array([1, 2, 3]) # Shape: (3,) b = np.array([[1], [2], [3]]) # Shape: (3, 1) print(a + b) # NumPy broadcasts to make shapes compatible # Output: # [[2 3 4] # [3 4 5] # [4 5 6]] # Broadcasting with scalars print(a * 2) # Multiply each element by 2: [2 4 6] ``` #### Mathematical Operations NumPy provides vectorized mathematical functions that operate element-wise on arrays: ```python arr = np.array([1, 2, 3, 4]) # Basic arithmetic print(arr + 2) # Addition: [3 4 5 6] print(arr * 3) # Multiplication: [3 6 9 12] print(arr ** 2) # Exponentiation: [1 4 9 16] # Mathematical functions print(np.sqrt(arr)) # Element-wise square root: [1. 1.41421356 1.73205081 2.] print(np.exp(arr)) # Element-wise exponential print(np.log(arr)) # Element-wise natural logarithm print(np.sin(arr)) # Element-wise sine function ``` #### Statistical Operations NumPy includes functions for statistical analysis, essential for understanding data distributions: ```python arr = np.array([[1, 2], [3, 4]]) # Basic statistics print(np.mean(arr)) # Mean of all elements: 2.5 print(np.mean(arr, axis=0)) # Mean of each column: [2. 3.] print(np.mean(arr, axis=1)) # Mean of each row: [1.5 3.5] print(np.std(arr)) # Standard deviation print(np.min(arr)) # Minimum value: 1 print(np.max(arr)) # Maximum value: 4 print(np.median(arr)) # Median value: 2.5 print(np.percentile(arr, 75)) # 75th percentile: 3.25 ``` #### Linear Algebra NumPy provides functions for linear algebra operations, crucial for many ML algorithms: ```python a = np.array([[1, 2], [3, 4]]) b = np.array([[5, 6], [7, 8]]) # Matrix operations print(np.dot(a, b)) # Matrix multiplication # Output: # [[19 22] # [43 50]] print(a @ b) # Alternative matrix multiplication syntax # Advanced linear algebra print(np.linalg.inv(a)) # Matrix inverse print(np.linalg.det(a)) # Matrix determinant: -2.0 eigenvalues, eigenvectors = np.linalg.eig(a) # Eigenvalues and eigenvectors print(eigenvalues) print(np.linalg.norm(a)) # Matrix norm ``` #### Random Number Generation NumPy's random module is useful for simulations, data augmentation, and initializing ML models: ```python # Random number generation random_arr = np.random.rand(3, 3) # 3x3 array of random values between 0 and 1 normal_arr = np.random.normal(0, 1, (3, 3)) # 3x3 array from normal distribution random_ints = np.random.randint(0, 10, (3, 3)) # 3x3 array of random integers 0-9 # Setting a seed for reproducibility np.random.seed(42) print(np.random.rand(3)) # This will always generate the same random numbers ``` #### Learning Resources for NumPy ##### Video Tutorials 1. **Corey Schafer's NumPy Tutorial** - [NumPy Tutorial](https://www.youtube.com/watch?v=GB9ByFAIAH4) - Comprehensive introduction to NumPy arrays and operations with clear explanations and practical examples. 2. **freeCodeCamp's NumPy Course** - [NumPy Full Course](https://www.youtube.com/watch?v=QUT1VHiLmmI) - In-depth 4-hour course covering all aspects of NumPy from basics to advanced operations. 3. **Keith Galli's NumPy Tutorial** - [Complete NumPy Tutorial](https://www.youtube.com/watch?v=GB9ByFAIAH4) - Hands-on tutorial with real-world examples and visualizations. ##### Online Courses 1. **DataCamp: Introduction to NumPy** - [Course Link](https://www.datacamp.com/courses/introduction-to-numpy) - Interactive course with exercises focused on NumPy fundamentals. 2. **Udemy: NumPy for Data Science** - [Course Link](https://www.udemy.com/course/numpy-for-data-science/) - Practical course focusing on NumPy for data manipulation and analysis. ##### Books 1. **Guide to NumPy** by Travis E. Oliphant - Comprehensive reference written by the creator of NumPy, covering all aspects of the library. 2. **Python for Data Analysis** by Wes McKinney - Chapters 4 and 5 provide excellent coverage of NumPy with practical examples. ### 2. Pandas **Purpose:** Data manipulation and analysis Pandas is essential for data handling and preprocessing in machine learning. It provides data structures like DataFrames, making it easier to work with structured datasets. **Installation:** ```bash pip install pandas ``` #### DataFrame Creation DataFrames are the primary data structure in Pandas, similar to tables in a database or spreadsheets: ```python import pandas as pd import numpy as np # Creating a DataFrame from a dictionary data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Age': [25, 30, 35, 40], 'City': ['New York', 'Paris', 'London', 'Tokyo']} df = pd.DataFrame(data) print(df) # Output: # Name Age City # 0 Alice 25 New York # 1 Bob 30 Paris # 2 Charlie 35 London # 3 David 40 Tokyo # Creating a DataFrame from a NumPy array array_data = np.random.rand(4, 3) df_array = pd.DataFrame(array_data, columns=['A', 'B', 'C']) print(df_array) # Creating a DataFrame from CSV file # df_csv = pd.read_csv('data.csv') ``` #### Data Inspection Examining your data is a crucial first step in any data analysis or ML project: ```python # Basic DataFrame information print(df.shape) # Dimensions of the DataFrame: (4, 3) print(df.dtypes) # Data types of each column print(df.info()) # Summary of DataFrame including non-null values print(df.describe()) # Statistical summary of numerical columns # Viewing data print(df.head(2)) # First 2 rows print(df.tail(2)) # Last 2 rows print(df.sample(2)) # Random 2 rows ``` #### Data Selection and Indexing Pandas offers multiple ways to select and access data: ```python # Column selection print(df['Name']) # Select a single column (returns Series) print(df[['Name', 'Age']]) # Select multiple columns (returns DataFrame) # Row selection by position print(df.iloc[0]) # First row print(df.iloc[1:3]) # Rows 1 and 2 print(df.iloc[0, 1]) # Value at row 0, column 1 (Alice's age: 25) print(df.iloc[:2, 1:3]) # First 2 rows, columns 1 and 2 # Row selection by label df.set_index('Name', inplace=True) # Set 'Name' as index print(df.loc['Alice']) # Row with index 'Alice' print(df.loc[['Alice', 'Bob']]) # Multiple rows by index print(df.loc['Alice', 'Age']) # Value at row 'Alice', column 'Age' # Boolean indexing df.reset_index(inplace=True) # Reset index back to default print(df[df['Age'] > 30]) # Rows where Age > 30 print(df[(df['Age'] > 25) & (df['City'] == 'London')]) # Combined conditions ``` #### Data Cleaning Real-world data is often messy. Pandas provides tools to clean and prepare data: ```python # Create a DataFrame with missing values data_missing = {'A': [1, 2, np.nan, 4], 'B': [5, np.nan, np.nan, 8], 'C': [9, 10, 11, 12]} df_missing = pd.DataFrame(data_missing) print(df_missing) # Handling missing values print(df_missing.isnull()) # Boolean mask of missing values print(df_missing.isnull().sum()) # Count missing values per column df_filled = df_missing.fillna(0) # Fill missing values with 0 df_dropped = df_missing.dropna() # Drop rows with any missing values df_dropped_cols = df_missing.dropna(axis=1) # Drop columns with any missing values # Removing duplicates df_dup = pd.DataFrame({'A': [1, 1, 2, 3], 'B': [4, 4, 5, 6]}) df_unique = df_dup.drop_duplicates() print(df_unique) ``` #### Data Transformation Transforming data is often necessary to prepare it for machine learning: ```python # Adding and modifying columns df['Salary'] = [50000, 60000, 75000, 90000] # Add new column df['Age_Squared'] = df['Age'] ** 2 # Derived column df['Age'] = df['Age'] + 1 # Modify existing column # Applying functions df['Age_Group'] = df['Age'].apply(lambda x: 'Young' if x < 35 else 'Senior') print(df) # Mapping values city_map = {'New York': 'US', 'Paris': 'France', 'London': 'UK', 'Tokyo': 'Japan'} df['Country'] = df['City'].map(city_map) print(df) # One-hot encoding categorical variables df_encoded = pd.get_dummies(df, columns=['City']) print(df_encoded.head()) ``` #### Grouping and Aggregation Grouping operations are essential for summarizing data: ```python # Create a DataFrame for grouping examples data_sales = { 'Product': ['A', 'B', 'A', 'C', 'B', 'A'], 'Region': ['East', 'West', 'East', 'West', 'East', 'West'], 'Sales': [100, 200, 150, 250, 300, 350], 'Quantity': [10, 15, 12, 18, 20, 25] } df_sales = pd.DataFrame(data_sales) print(df_sales) # Grouping by a single column grouped = df_sales.groupby('Product') print(grouped.mean()) # Mean of numerical columns by product print(grouped['Sales'].sum()) # Sum of sales by product # Grouping by multiple columns grouped_multi = df_sales.groupby(['Product', 'Region']) print(grouped_multi.sum()) # Sum by product and region # Aggregating with multiple functions agg_result = df_sales.groupby('Product').agg({ 'Sales': ['sum', 'mean', 'max'], 'Quantity': ['sum', 'mean'] }) print(agg_result) ``` #### Merging and Joining Combining data from different sources is a common operation: ```python # Create two DataFrames to merge df1 = pd.DataFrame({ 'ID': [1, 2, 3, 4], 'Name': ['Alice', 'Bob', 'Charlie', 'David'] }) df2 = pd.DataFrame({ 'ID': [1, 2, 3, 5], 'Salary': [50000, 60000, 75000, 90000] }) # Inner join (only matching keys) merged_inner = pd.merge(df1, df2, on='ID', how='inner') print(merged_inner) # Left join (all keys from left DataFrame) merged_left = pd.merge(df1, df2, on='ID', how='left') print(merged_left) # Right join (all keys from right DataFrame) merged_right = pd.merge(df1, df2, on='ID', how='right') print(merged_right) # Outer join (all keys from both DataFrames) merged_outer = pd.merge(df1, df2, on='ID', how='outer') print(merged_outer) # Concatenating DataFrames df3 = pd.DataFrame({ 'ID': [6, 7], 'Name': ['Eve', 'Frank'] }) concatenated = pd.concat([df1, df3]) print(concatenated) ``` #### Time Series Operations Pandas has excellent support for time series data: ```python # Create a time series DataFrame dates = pd.date_range('20230101', periods=6) df_time = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD')) print(df_time) # Time-based indexing print(df_time['2023-01-02':'2023-01-04']) # Slice by date range # Resampling print(df_time.resample('2D').mean()) # Resample to every 2 days print(df_time.resample('M').sum()) # Monthly sum # Shifting data print(df_time.shift(1)) # Shift values by 1 period print(df_time['A'].diff()) # Difference between consecutive values # Rolling statistics print(df_time.rolling(window=3).mean()) # 3-period rolling average ``` #### Data Export Saving processed data for later use or sharing: ```python # Export to various formats # df.to_csv('output.csv', index=False) # df.to_excel('output.xlsx', sheet_name='Sheet1') # df.to_json('output.json') # df.to_sql('table_name', connection_object) # Convert to other formats numpy_array = df.to_numpy() # Convert to NumPy array dict_data = df.to_dict() # Convert to dictionary ``` #### Learning Resources for Pandas ##### Video Tutorials 1. **Corey Schafer's Pandas Series** - [Pandas Tutorials](https://www.youtube.com/playlist?list=PL-osiE80TeTvipOqomVEeZ1HRrcEvtZB_) - Step-by-step tutorials covering all major Pandas functionality with real-world examples. 2. **Data School's Pandas Tutorials** - [Pandas Best Practices](https://www.youtube.com/watch?v=dPwLlJkSHLo) - Practical tips and techniques for efficient data manipulation with Pandas. 3. **sentdex's Pandas Basics** - [Pandas Basics](https://www.youtube.com/playlist?list=PLQVvvaa0QuDfefDfXb9Yf0la1fPDKluPF) - Beginner-friendly introduction to Pandas with practical applications. ##### Online Courses 1. **Kaggle: Pandas** - [Course Link](https://www.kaggle.com/learn/pandas) - Free hands-on course with exercises using real-world datasets. 2. **DataCamp: Data Manipulation with pandas** - [Course Link](https://www.datacamp.com/courses/data-manipulation-with-pandas) - Interactive course covering data manipulation techniques. ##### Books 1. **Python for Data Analysis** by Wes McKinney - Written by the creator of Pandas, this book provides comprehensive coverage with practical examples. 2. **Pandas Cookbook** by Theodore Petrou - Recipe-based approach to solving common data analysis problems with Pandas. ### 3. Matplotlib & Seaborn **Purpose:** Data visualization Matplotlib and Seaborn help in visualizing datasets, which is crucial for understanding patterns and trends before applying machine learning models. Effective visualization is essential for exploratory data analysis, feature engineering, and communicating results. **Installation:** ```bash pip install matplotlib seaborn ``` #### Basic Plotting with Matplotlib Matplotlib is the foundation of Python visualization and provides precise control over plots: ```python import matplotlib.pyplot as plt import numpy as np # Simple line plot x = np.linspace(0, 10, 100) # Create 100 points from 0 to 10 y = np.sin(x) # Calculate sine values plt.figure(figsize=(8, 4)) # Set figure size (width, height in inches) plt.plot(x, y, 'b-', label='Sine') # 'b-' means blue line plt.title('Sine Function') plt.xlabel('x values') plt.ylabel('sin(x)') plt.grid(True) # Add grid plt.legend() # Show legend plt.show() # Display the plot ``` #### Multiple Plots and Customization Creating multiple plots and customizing their appearance: ```python # Multiple lines on the same plot x = np.linspace(0, 2*np.pi, 100) plt.figure(figsize=(10, 6)) plt.plot(x, np.sin(x), 'r-', label='Sine') plt.plot(x, np.cos(x), 'g--', label='Cosine') # 'g--' means green dashed line plt.title('Sine and Cosine Functions', fontsize=16) plt.xlabel('Angle (radians)', fontsize=12) plt.ylabel('Value', fontsize=12) plt.legend(fontsize=12) plt.grid(True) plt.show() # Subplots (multiple plots in one figure) plt.figure(figsize=(12, 5)) # First subplot plt.subplot(1, 2, 1) # 1 row, 2 columns, 1st subplot plt.plot(x, np.sin(x), 'b-') plt.title('Sine Function') plt.grid(True) # Second subplot plt.subplot(1, 2, 2) # 1 row, 2 columns, 2nd subplot plt.plot(x, np.cos(x), 'r-') plt.title('Cosine Function') plt.grid(True) plt.tight_layout() # Adjust spacing between subplots plt.show() ``` #### Different Plot Types in Matplotlib Matplotlib supports many types of visualizations: ```python # Bar chart categories = ['A', 'B', 'C', 'D', 'E'] values = [25, 40, 30, 55, 15] plt.figure(figsize=(8, 5)) plt.bar(categories, values, color='skyblue') plt.title('Bar Chart Example') plt.xlabel('Categories') plt.ylabel('Values') plt.show() # Scatter plot x = np.random.rand(50) y = np.random.rand(50) colors = np.random.rand(50) sizes = 1000 * np.random.rand(50) plt.figure(figsize=(8, 6)) plt.scatter(x, y, c=colors, s=sizes, alpha=0.6) # alpha controls transparency plt.title('Scatter Plot Example') plt.xlabel('X values') plt.ylabel('Y values') plt.colorbar() # Add color scale plt.show() # Histogram data = np.random.randn(1000) # 1000 points from standard normal distribution plt.figure(figsize=(8, 5)) plt.hist(data, bins=30, color='purple', alpha=0.7) plt.title('Histogram Example') plt.xlabel('Value') plt.ylabel('Frequency') plt.grid(True, alpha=0.3) plt.show() # Pie chart labels = ['A', 'B', 'C', 'D'] sizes = [15, 30, 45, 10] explode = (0, 0.1, 0, 0) # Explode the 2nd slice (B) plt.figure(figsize=(8, 8)) plt.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90) plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle plt.title('Pie Chart Example') plt.show() ``` #### Statistical Visualization with Seaborn Seaborn builds on Matplotlib and provides a higher-level interface for statistical graphics: ```python import seaborn as sns import pandas as pd # Set the aesthetic style sns.set_theme(style="whitegrid") # Sample dataset tips = sns.load_dataset("tips") # Built-in dataset print(tips.head()) # Preview the data # Distribution plot plt.figure(figsize=(8, 6)) sns.histplot(tips['total_bill'], kde=True) # Histogram with density curve plt.title('Distribution of Total Bill') plt.xlabel('Total Bill ($)') plt.ylabel('Count') plt.show() # Categorical plot - Box plot plt.figure(figsize=(10, 6)) sns.boxplot(x='day', y='total_bill', data=tips) plt.title('Total Bill by Day') plt.xlabel('Day') plt.ylabel('Total Bill ($)') plt.show() # Categorical plot - Violin plot plt.figure(figsize=(10, 6)) sns.violinplot(x='day', y='total_bill', hue='sex', data=tips, split=True) plt.title('Total Bill by Day and Gender') plt.xlabel('Day') plt.ylabel('Total Bill ($)') plt.show() ``` #### Relationship Plots in Seaborn Visualizing relationships between variables: ```python # Scatter plot with regression line plt.figure(figsize=(8, 6)) sns.regplot(x='total_bill', y='tip', data=tips) plt.title('Tip vs. Total Bill with Regression Line') plt.xlabel('Total Bill ($)') plt.ylabel('Tip ($)') plt.show() # Pair plot - Scatter plots for all pairs of variables plt.figure(figsize=(10, 8)) sns.pairplot(tips, hue='sex') plt.suptitle('Pair Plot of Tips Dataset', y=1.02) plt.show() # Correlation heatmap # Create a correlation matrix from the tips dataset corr = tips.corr() plt.figure(figsize=(8, 6)) sns.heatmap(corr, annot=True, cmap='coolwarm', vmin=-1, vmax=1) plt.title('Correlation Heatmap') plt.show() ``` #### Advanced Visualizations More complex visualizations for specific analysis needs: ```python # Joint plot - Combines scatter plot with histograms plt.figure(figsize=(8, 8)) sns.jointplot(x='total_bill', y='tip', data=tips, kind='hex') plt.suptitle('Joint Distribution of Total Bill and Tip', y=1.02) plt.show() # Facet grid - Multiple plots based on categories g = sns.FacetGrid(tips, col='time', row='sex', height=4) g.map_dataframe(sns.scatterplot, x='total_bill', y='tip') g.add_legend() plt.show() # Count plot - Bar plot for categorical data plt.figure(figsize=(8, 6)) sns.countplot(x='day', hue='sex', data=tips) plt.title('Count of Meals by Day and Gender') plt.xlabel('Day') plt.ylabel('Count') plt.show() ``` #### Saving Visualizations Exporting your visualizations for reports or presentations: ```python # Create a plot plt.figure(figsize=(8, 6)) sns.scatterplot(x='total_bill', y='tip', hue='day', data=tips) plt.title('Tip vs. Total Bill by Day') # Save the plot to a file # plt.savefig('scatter_plot.png', dpi=300, bbox_inches='tight') # plt.savefig('scatter_plot.pdf', bbox_inches='tight') # PDF format # plt.savefig('scatter_plot.svg', bbox_inches='tight') # SVG format plt.show() ``` #### Learning Resources for Matplotlib & Seaborn ##### Video Tutorials 1. **Corey Schafer's Matplotlib Tutorials** - [Matplotlib Tutorial Series](https://www.youtube.com/playlist?list=PL-osiE80TeTvipOqomVEeZ1HRrcEvtZB_) - Comprehensive series covering all aspects of Matplotlib with clear explanations. 2. **Sentdex's Data Visualization** - [Data Visualization with Matplotlib and Python](https://www.youtube.com/playlist?list=PLQVvvaa0QuDfefDfXb9Yf0la1fPDKluPF) - Practical examples of creating various types of visualizations. 3. **Python Programmer's Seaborn Tutorial** - [Seaborn Tutorial](https://www.youtube.com/watch?v=6GUZXDef2U0) - Comprehensive guide to statistical data visualization with Seaborn. ##### Online Courses 1. **DataCamp: Introduction to Data Visualization with Matplotlib** - [Course Link](https://www.datacamp.com/courses/introduction-to-data-visualization-with-matplotlib) - Interactive course covering the fundamentals of Matplotlib. 2. **DataCamp: Introduction to Data Visualization with Seaborn** - [Course Link](https://www.datacamp.com/courses/introduction-to-data-visualization-with-seaborn) - Hands-on course for statistical visualization with Seaborn. ##### Books 1. **Python Data Science Handbook** by Jake VanderPlas - Chapters 4 and 5 provide excellent coverage of Matplotlib and Seaborn with practical examples. 2. **Interactive Data Visualization with Python** by Abha Belorkar and Sharath Chandra Guntuku - Comprehensive guide to creating interactive visualizations with Matplotlib, Seaborn, and other libraries. ### 4. Scikit-learn **Purpose:** Machine learning models and preprocessing Scikit-learn is the most widely used library for classical machine learning algorithms like regression, classification, clustering, and feature selection. It also includes tools for model evaluation and hyperparameter tuning. **Installation:** ```bash pip install scikit-learn ``` #### Data Preparation Scikit-learn provides tools for preparing your data for machine learning: ```python from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler import numpy as np import pandas as pd # Loading a sample dataset iris = load_iris() X = iris.data # Features y = iris.target # Target variable # Splitting data into training and testing sets X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42) print(f"Training set size: {X_train.shape}") print(f"Testing set size: {X_test.shape}") # Feature scaling scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Use the same scaler for test data # Check the effect of scaling print("Before scaling - First sample:", X_train[0]) print("After scaling - First sample:", X_train_scaled[0]) ``` #### Classification Models Scikit-learn offers various classification algorithms: ```python from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.svm import SVC from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score, classification_report, confusion_matrix # Logistic Regression log_reg = LogisticRegression(random_state=42) log_reg.fit(X_train_scaled, y_train) log_reg_pred = log_reg.predict(X_test_scaled) print("Logistic Regression Accuracy:", accuracy_score(y_test, log_reg_pred)) # Decision Tree tree = DecisionTreeClassifier(random_state=42) tree.fit(X_train, y_train) # Decision trees don't require scaling tree_pred = tree.predict(X_test) print("Decision Tree Accuracy:", accuracy_score(y_test, tree_pred)) # Random Forest rf = RandomForestClassifier(n_estimators=100, random_state=42) rf.fit(X_train, y_train) rf_pred = rf.predict(X_test) print("Random Forest Accuracy:", accuracy_score(y_test, rf_pred)) # Support Vector Machine svm = SVC(kernel='rbf', random_state=42) svm.fit(X_train_scaled, y_train) svm_pred = svm.predict(X_test_scaled) print("SVM Accuracy:", accuracy_score(y_test, svm_pred)) # K-Nearest Neighbors knn = KNeighborsClassifier(n_neighbors=5) knn.fit(X_train_scaled, y_train) knn_pred = knn.predict(X_test_scaled) print("KNN Accuracy:", accuracy_score(y_test, knn_pred)) # Detailed evaluation of the best model print("\nRandom Forest Classification Report:") print(classification_report(y_test, rf_pred, target_names=iris.target_names)) # Confusion matrix cm = confusion_matrix(y_test, rf_pred) print("Confusion Matrix:") print(cm) ``` #### Regression Models For predicting continuous values: ```python from sklearn.datasets import load_boston from sklearn.linear_model import LinearRegression, Ridge, Lasso from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error, r2_score # Load dataset (Note: Boston dataset is for example only) # In newer scikit-learn versions, you might need to use fetch_california_housing instead try: boston = load_boston() X = boston.data y = boston.target except: # Alternative dataset if Boston is not available from sklearn.datasets import fetch_california_housing housing = fetch_california_housing() X = housing.data y = housing.target # Split and scale data X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Linear Regression lr = LinearRegression() lr.fit(X_train_scaled, y_train) lr_pred = lr.predict(X_test_scaled) print("Linear Regression MSE:", mean_squared_error(y_test, lr_pred)) print("Linear Regression R²:", r2_score(y_test, lr_pred)) # Ridge Regression (L2 regularization) ridge = Ridge(alpha=1.0, random_state=42) ridge.fit(X_train_scaled, y_train) ridge_pred = ridge.predict(X_test_scaled) print("Ridge Regression MSE:", mean_squared_error(y_test, ridge_pred)) print("Ridge Regression R²:", r2_score(y_test, ridge_pred)) # Lasso Regression (L1 regularization) lasso = Lasso(alpha=0.1, random_state=42) lasso.fit(X_train_scaled, y_train) lasso_pred = lasso.predict(X_test_scaled) print("Lasso Regression MSE:", mean_squared_error(y_test, lasso_pred)) print("Lasso Regression R²:", r2_score(y_test, lasso_pred)) # Random Forest Regressor rf_reg = RandomForestRegressor(n_estimators=100, random_state=42) rf_reg.fit(X_train, y_train) rf_reg_pred = rf_reg.predict(X_test) print("Random Forest MSE:", mean_squared_error(y_test, rf_reg_pred)) print("Random Forest R²:", r2_score(y_test, rf_reg_pred)) ``` #### Clustering Unsupervised learning for finding patterns: ```python from sklearn.cluster import KMeans, DBSCAN from sklearn.metrics import silhouette_score import matplotlib.pyplot as plt # Use only two features for visualization X_cluster = X[:, :2] # K-Means clustering kmeans = KMeans(n_clusters=3, random_state=42) kmeans_labels = kmeans.fit_predict(X_cluster) print("K-Means Silhouette Score:", silhouette_score(X_cluster, kmeans_labels)) # DBSCAN clustering dbscan = DBSCAN(eps=0.5, min_samples=5) dbscan_labels = dbscan.fit_predict(X_cluster) # Only calculate silhouette if there's more than one cluster and no noise points (-1) if len(set(dbscan_labels)) > 1 and -1 not in dbscan_labels: print("DBSCAN Silhouette Score:", silhouette_score(X_cluster, dbscan_labels)) # Visualize clusters plt.figure(figsize=(12, 5)) # K-Means plot plt.subplot(1, 2, 1) plt.scatter(X_cluster[:, 0], X_cluster[:, 1], c=kmeans_labels, cmap='viridis') plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red', marker='X') plt.title('K-Means Clustering') # DBSCAN plot plt.subplot(1, 2, 2) plt.scatter(X_cluster[:, 0], X_cluster[:, 1], c=dbscan_labels, cmap='viridis') plt.title('DBSCAN Clustering') plt.tight_layout() plt.show() ``` #### Dimensionality Reduction Reducing features while preserving information: ```python from sklearn.decomposition import PCA from sklearn.manifold import TSNE # PCA (Principal Component Analysis) pca = PCA(n_components=2) X_pca = pca.fit_transform(X) print("Explained variance ratio:", pca.explained_variance_ratio_) print("Total explained variance:", sum(pca.explained_variance_ratio_)) # t-SNE (t-Distributed Stochastic Neighbor Embedding) tsne = TSNE(n_components=2, random_state=42) X_tsne = tsne.fit_transform(X) # Visualize the results plt.figure(figsize=(12, 5)) # PCA plot plt.subplot(1, 2, 1) plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis') plt.title('PCA Dimensionality Reduction') plt.xlabel('First Principal Component') plt.ylabel('Second Principal Component') # t-SNE plot plt.subplot(1, 2, 2) plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis') plt.title('t-SNE Dimensionality Reduction') plt.tight_layout() plt.show() ``` #### Model Evaluation and Cross-Validation Techniques to assess model performance: ```python from sklearn.model_selection import cross_val_score, GridSearchCV from sklearn.metrics import confusion_matrix, roc_curve, auc import matplotlib.pyplot as plt # Cross-validation model = RandomForestClassifier(random_state=42) cv_scores = cross_val_score(model, X, y, cv=5) # 5-fold cross-validation print("Cross-validation scores:", cv_scores) print("Mean CV score:", cv_scores.mean()) print("Standard deviation:", cv_scores.std()) # Hyperparameter tuning with GridSearchCV param_grid = { 'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20, 30] } grid_search = GridSearchCV( RandomForestClassifier(random_state=42), param_grid, cv=5, scoring='accuracy', return_train_score=True ) grid_search.fit(X, y) print("Best parameters:", grid_search.best_params_) print("Best cross-validation score:", grid_search.best_score_) # Get the best model best_model = grid_search.best_estimator_ # For binary classification, we can plot ROC curve # Let's create a binary classification problem from Iris binary_y = (y == 0).astype(int) # 1 if class is 0, 0 otherwise X_train, X_test, y_train, y_test = train_test_split( X, binary_y, test_size=0.2, random_state=42) # Train a model model = RandomForestClassifier(random_state=42) model.fit(X_train, y_train) # Get predicted probabilities y_proba = model.predict_proba(X_test)[:, 1] # Calculate ROC curve fpr, tpr, thresholds = roc_curve(y_test, y_proba) roc_auc = auc(fpr, tpr) # Plot ROC curve plt.figure(figsize=(8, 6)) plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})') plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('Receiver Operating Characteristic (ROC) Curve') plt.legend(loc="lower right") plt.show() ``` #### Feature Selection Identifying the most important features: ```python from sklearn.feature_selection import SelectKBest, f_classif, RFE # Select K best features based on ANOVA F-value selector = SelectKBest(f_classif, k=2) X_new = selector.fit_transform(X, y) print("Selected feature indices:", np.where(selector.get_support())[0]) print("F-scores:", selector.scores_) # Recursive Feature Elimination model = RandomForestClassifier(random_state=42) rfe = RFE(estimator=model, n_features_to_select=2) X_rfe = rfe.fit_transform(X, y) print("RFE selected features:", np.where(rfe.support_)[0]) print("RFE feature ranking:", rfe.ranking_) # Feature importance from tree-based models model = RandomForestClassifier(random_state=42) model.fit(X, y) importances = model.feature_importances_ indices = np.argsort(importances)[::-1] plt.figure(figsize=(10, 6)) plt.title("Feature Importances") plt.bar(range(X.shape[1]), importances[indices], align="center") plt.xticks(range(X.shape[1]), indices) plt.xlim([-1, X.shape[1]]) plt.show() print("Feature ranking:") for f in range(X.shape[1]): print(f"{f+1}. Feature {indices[f]} ({importances[indices[f]]:.4f})") ``` #### Pipeline Combining preprocessing and modeling steps: ```python from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer # Create a sample dataset with mixed types and missing values data = { 'age': [25, 30, np.nan, 40, 50], 'income': [50000, np.nan, 70000, 90000, 120000], 'gender': ['M', 'F', 'M', 'F', 'M'], 'education': ['HS', 'Bachelor', 'Master', 'PhD', 'Bachelor'] } df = pd.DataFrame(data) X = df.drop('income', axis=1) y = df['income'] # Define preprocessing for numerical and categorical features numerical_features = ['age'] categorical_features = ['gender', 'education'] numerical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler()) ]) categorical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore')) ]) # Combine preprocessing steps preprocessor = ColumnTransformer( transformers=[ ('num', numerical_transformer, numerical_features), ('cat', categorical_transformer, categorical_features) ]) # Create the full pipeline with preprocessing and model full_pipeline = Pipeline(steps=[ ('preprocessor', preprocessor), ('model', RandomForestRegressor(random_state=42)) ]) # Fit the pipeline full_pipeline.fit(X, y) # Make predictions predictions = full_pipeline.predict(X) print("Predictions:", predictions) ``` #### Learning Resources for Scikit-learn ##### Video Tutorials 1. **sentdex's Machine Learning with Python** - [Machine Learning with Python](https://www.youtube.com/playlist?list=PLQVvvaa0QuDd0flgGphKCej-9jp-QdzZ3) - Practical tutorials on implementing various ML algorithms with scikit-learn. 2. **Data School's scikit-learn Tutorials** - [Machine Learning in Python with scikit-learn](https://www.youtube.com/playlist?list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A) - Step-by-step tutorials covering the entire ML workflow. 3. **Krish Naik's Machine Learning Playlist** - [Machine Learning Tutorial Python](https://www.youtube.com/playlist?list=PLZoTAELRMXVPBTrWtJkn3wWQxZkmTXGwe) - Comprehensive series covering ML concepts and implementation with scikit-learn. ##### Online Courses 1. **Coursera: Applied Machine Learning in Python** - [Course Link](https://www.coursera.org/learn/python-machine-learning) - University of Michigan course focusing on scikit-learn for applied ML. 2. **DataCamp: Supervised Learning with scikit-learn** - [Course Link](https://www.datacamp.com/courses/supervised-learning-with-scikit-learn) - Interactive course covering classification, regression, and model evaluation. 3. **edX: Machine Learning with Python: from Linear Models to Deep Learning** - [Course Link](https://www.edx.org/course/machine-learning-with-python-from-linear-models-to) - MIT course covering ML fundamentals with Python implementation. ##### Books 1. **Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow** by Aurélien Géron - Comprehensive guide to ML with practical examples using scikit-learn (Part 1 of the book). 2. **Introduction to Machine Learning with Python** by Andreas C. Müller & Sarah Guido - Focused specifically on scikit-learn with practical examples and explanations. 3. **Python Machine Learning** by Sebastian Raschka & Vahid Mirjalili - Comprehensive coverage of ML concepts and implementation with scikit-learn. ### 5. TensorFlow & Keras **Purpose:** Deep learning TensorFlow is a powerful framework for deep learning, and Keras (which comes integrated with TensorFlow) simplifies building neural networks. **Installation:** ```bash pip install tensorflow ``` **Example Usage:** ```python import tensorflow as tf # Creating a simple neural network model = tf.keras.Sequential([ tf.keras.layers.Dense(10, activation='relu'), tf.keras.layers.Dense(1) ]) # Compiling the model model.compile(optimizer='adam', loss='mse') ``` ### 6. PyTorch **Purpose:** Deep learning and research PyTorch is another deep learning library known for its flexibility and ease of use, making it a favorite among researchers. **Installation:** ```bash pip install torch ``` **Example Usage:** ```python import torch # Creating a simple tensor x = torch.tensor([1.0, 2.0, 3.0]) print(x * 2) # Element-wise multiplication ``` ### 7. XGBoost & LightGBM **Purpose:** Boosted decision trees for high-performance machine learning XGBoost and LightGBM are widely used for structured data tasks in competitions and real-world applications due to their speed and accuracy. **Installation:** ```bash pip install xgboost lightgbm ``` **Example Usage:** ```python import xgboost as xgb # Creating a simple XGBoost model model = xgb.XGBRegressor(objective='reg:squarederror') ``` These Python libraries form the foundation of machine learning development. Whether you are handling data, training models, or deploying solutions, mastering these libraries will significantly enhance your machine learning workflow. Each library serves a specific purpose, and together, they provide a comprehensive toolkit for both beginners and experts in the field of machine learning.