NexaXOE is a comprehensive framework for applying machine learning (ML) to scientific problems, focusing on efficient compute resource utilization, memory management, and structured code organization. This document details the compute stack, a material science project for lithium battery materials, and the Nexa_Inference application for scalable predictions.
## Navigation and Structure
The Nexa ecosystem is organized hierarchically for intuitive access and ease of use. Resources are structured from the meta-node (NexaXOE) to specialized nodes and project-specific areas, ensuring seamless exploration of scientific ML applications.
### Navigation Tips
1. **Start Here**: Use NexaXOE.md as the entry point to access core nodes (Physics, Biology, Material Science, Learning) via the links below.
2. **Explore Nodes**: Navigate to node files (e.g., [Physics_Node.md]) for lists of projects, datasets, and documentation specific to each domain.
3. **Dive into Projects**: Follow project links within nodes (e.g., [Quantum_State_Tomography.md]) to access detailed papers, code, and datasets.
### How the Structure Works
- **Meta-Node (NexaXOE)**: This file serves as the central hub, linking to all nodes and providing an overview of the ecosystem.
- **Nodes**: Specialized hubs (Physics, Biology, Material Science, Learning) organize projects and resources by domain, each with its own Markdown file (e.g., [Learning_Node.md]).
- **Projects**: Individual project files (e.g., [Battery_Material_Prediction.md]) contain detailed summaries, links to code/datasets (e.g., Kaggle, GitHub), and connections back to nodes.
### Organization and Ease of Use
Resources are organized to minimize friction: nodes act as navigation hubs, projects link to practical resources (e.g., Kaggle Material Science Dataset), and internal placeholders (e.g., [[Project_Name]]) indicate future content. Cross-node links and a consistent Markdown format ensure users can quickly find relevant materials.
## Core Nodes
- **[Physics Node]**
- Focus: Astrophysics, quantum mechanics, high-energy physics, and computational fluid dynamics.
- Link: [Physics_Node.md]
- **[Biology Node]**
- Focus: Protein structure prediction, bioinformatics, and biological system modeling.
- Link: [Biology_Node.md]
- **[Material Science Node]**
- Focus: Material property prediction, battery materials, and crystal structure analysis.
- Link: [Material_Science_Node.md]
- **[Learning Node]**
- Focus: Machine learning education, covering general ML, deep learning, attention mechanisms, and more.
- Link: [Learning_Node.md]
## Ecosystem Overview
- **Nexa Infrastructure**: Graph compilers, tensor allocators, and HPC toolchains for optimized compute.
- **Nexa R&D**: Experimental research with novel architectures and optimizers.
- **Nexa Data Studio**: Purpose-built datasets and preprocessing pipelines.
- **Nexa MOE**: Mixture-of-Experts models for scientific inference and hypothesis generation.
## Compute Stack Configuration
### CPU-GPU Split and Memory Management
- **Purpose**: Optimize resource usage by assigning CPU for data preprocessing and tensor generation, and GPU for model weight computation.
- **Benefits**:
- Reduces VRAM bottlenecks and enhances CPU/GPU efficiency.
- Example: Data transfer time reduced from 5–10 minutes to 1–2 minutes.
- Achieved CPU overclocking to 5 GHz on Intel Core i5 vPro 8th Gen (base 1.9 GHz).
- **Techniques**:
- Multithreading and Just-In-Time (JIT) compiling for faster data processing.
- Efficient data pipelines for rapid iteration.
### Garbage Collection
- **Purpose**: Clear tensors to prevent memory build-up, improving pipeline flexibility and iteration speed.
- **Benefits**:
- Reduces crashes during hyperparameter tuning.
- Minimizes debugging time, enabling focus on high-level problem-solving.
- **Implementation**:
```python
import gc
import torch
gc.collect()
torch.cuda.empty_cache()
```
### CUDA Utilization
- **Description**: Leverages NVIDIA CUDA for GPU-accelerated computations.
- **Hardware**: Utilizes Kaggle T4 for model training, with plans for a custom cluster.
### Code Organization
- **Standard**: Place all imports at the script's top for clarity and structured pipeline flow.
- **Example Setup**:
```python
import os
import torch
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
```
## Material Science Project: Lithium Battery Materials
### Data Preprocessing
- **Purpose**: Load and preprocess lithium battery material data in chunks to manage memory efficiently.
- **Implementation**:
```python
import pandas as pd
import re
from sklearn.preprocessing import StandardScaler
import gc
def load_and_preprocess_data(csv_path, chunk_size=1000):
"""Load and preprocess data in chunks to manage memory."""
features = ['formation_energy_per_atom', 'energy_per_atom', 'density', 'volume', 'n_elements', 'li_fraction']
target = 'band_gap'
filtered_chunks = []
def parse_formula(formula):
elements = re.findall(r'([A-Z][a-z]?)(d*)', formula)
total_atoms = sum(int(n) if n else 1 for _, n in elements)
li_count = next((int(n) if n else 1 for e, n in elements if e == 'Li'), 0)
return li_count / total_atoms if total_atoms > 0 else 0
for chunk in pd.read_csv(csv_path, chunksize=chunk_size):
chunk['elements'] = chunk['elements'].apply(eval)
chunk['contains_li'] = chunk['elements'].apply(lambda x: 'Li' in x)
chunk['li_fraction'] = chunk['formula_pretty'].apply(parse_formula)
filtered_chunk = chunk[
(chunk['contains_li']) &
(chunk['is_semiconductor']) &
(chunk['band_gap'].between(1, 3)) &
(chunk['formation_energy_per_atom'] < 0)
].copy()
if not filtered_chunk.empty:
filtered_chunks.append(filtered_chunk[features + [target, 'elements']])
del chunk
gc.collect()
if not filtered_chunks:
raise ValueError("Filtered dataset is empty.")
filtered_df = pd.concat(filtered_chunks, ignore_index=True)
X = filtered_df[features].fillna(filtered_df[features].mean()).values
y = filtered_df[target].values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
return filtered_df, X_scaled, y, scaler, features
```
### Graph Neural Network (GNN) Model
- **Purpose**: Predict material properties using a GNN architecture.
- **Implementation**:
```python
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import GCNConv, global_mean_pool
class GNN(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim, dropout_prob=0.5):
super(GNN, self).__init__()
self.conv1 = GCNConv(input_dim, hidden_dim)
self.conv2 = GCNConv(hidden_dim, hidden_dim)
self.fc = nn.Linear(hidden_dim, output_dim)
self.dropout = nn.Dropout(dropout_prob)
def forward(self, data):
x, edge_index, batch = data.x, data.edge_index, data.batch
x = F.relu(self.conv1(x, edge_index))
x = self.dropout(x)
x = F.relu(self.conv2(x, edge_index))
x = self.dropout(x)
x = global_mean_pool(x, batch)
return self.fc(x)
```
### Training and Evaluation
- **Purpose**: Train and evaluate the GNN model, including uncertainty quantification.
- **Implementation**:
```python
import torch
import torch.nn.functional as F
from concurrent.futures import ThreadPoolExecutor
import numpy as np
import multiprocessing
def evaluate_gnn(model, loader_gen, device):
model.eval()
total_loss = 0
predictions, true_values = [], []
batches = 0
with torch.no_grad():
for loader in loader_gen:
for batch in loader:
batch = batch.to(device)
out = model(batch).squeeze()
loss = F.mse_loss(out, batch.y)
total_loss += loss.item()
predictions.extend(out.cpu().numpy())
true_values.extend(batch.y.cpu().numpy())
batches += 1
del batch, out, loss
return total_loss / batches if batches > 0 else 0, predictions, true_values
def predict_with_uncertainty(model, loader_gen, device, num_samples=20):
model.train()
predictions = []
with ThreadPoolExecutor(max_workers=multiprocessing.cpu_count()) as executor:
futures = []
for loader in loader_gen:
for batch in loader:
batch = batch.to(device)
futures.append(executor.submit(lambda b: [model(b).squeeze().detach().cpu().numpy() for _ in range(num_samples)], batch))
for future in futures:
preds = np.stack(future.result(), axis=0)
mean_preds = np.mean(preds, axis=0)
std_preds = np.std(preds, axis=0)
predictions.extend(zip(mean_preds, std_preds))
del preds
gc.collect()
return predictions
```
### Main Training Loop
- **Purpose**: Orchestrate the training of GNN and Variational Autoencoder (VAE) models.
- **Implementation**:
```python
import torch
import torch.optim as optim
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader, TensorDataset
from torch_geometric.loader import DataLoader as GeometricDataLoader
if __name__ == "__main__":
csv_path = "/kaggle/input/material-science/lithium_battery_materials.csv"
filtered_df, X_scaled, y, scaler, features = load_and_preprocess_data(csv_path, chunk_size=1000)
X_train, X_temp, y_train, y_temp = train_test_split(X_scaled, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
input_dim = 1
hidden_dim = 64
output_dim = 1
vae_input_dim = X_scaled.shape[1]
latent_dim = 32
batch_size = 32
gnn = GNN(input_dim, hidden_dim, output_dim).to(device)
vae = VAE(input_dim=vae_input_dim, latent_dim=latent_dim).to(device)
def vae_loss(recon_x, x, mu, logvar):
recon_loss = F.mse_loss(recon_x, x, reduction='sum')
kl_div = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
return recon_loss + kl_div
train_dataset = TensorDataset(torch.tensor(X_train, dtype=torch.float32))
val_dataset = TensorDataset(torch.tensor(X_val, dtype=torch.float32))
train_loader_vae = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader_vae = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
train_loader_gen = create_graph_data(filtered_df.iloc[:len(X_train)], X_train, y_train, batch_size=batch_size)
val_loader_gen = create_graph_data(filtered_df.iloc[len(X_train):len(X_train) + len(X_val)], X_val, y_val, batch_size=batch_size)
test_loader_gen = create_graph_data(filtered_df.iloc[len(X_train) + len(X_val):], X_test, y_test, batch_size=batch_size)
optimizer_gnn = optim.Adam(gnn.parameters(), lr=0.001)
scheduler_gnn = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer_gnn, T_max=50)
optimizer_vae = optim.Adam(vae.parameters(), lr=0.001)
scheduler_vae = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer_vae, T_max=50)
early_stopping_gnn = EarlyStopping(patience=10)
early_stopping_vae = EarlyStopping(patience=10)
gnn_train_losses, gnn_val_losses = [], []
vae_train_losses, vae_val_losses = [], []
max_epochs = 50
for epoch in range(max_epochs):
gnn_train_loss = train_gnn(gnn, train_loader_gen, optimizer_gnn, device)
gnn_val_loss, _, _ = evaluate_gnn(gnn, val_loader_gen, device)
gnn_train_losses.append(gnn_train_loss)
gnn_val_losses.append(gnn_val_loss)
scheduler_gnn.step()
vae_train_loss = train_vae(vae, train_loader_vae, optimizer_vae, device)
vae_val_loss = evaluate_vae(vae, val_loader_vae, device)
vae_train_losses.append(vae_train_loss)
vae_val_losses.append(vae_val_loss)
scheduler_vae.step()
early_stopping_gnn(gnn_val_loss)
early_stopping_vae(vae_val_loss)
if gnn_val_loss < float('inf'):
torch.save(gnn.state_dict(), 'best_gnn_model.pt')
if vae_val_loss < float('inf'):
torch.save(vae.state_dict(), 'best_vae_model.pt')
if epoch % 10 == 0:
print(f"Epoch {epoch}:")
print(f" GNN - Train Loss: {gnn_train_loss:.4f}, Val Loss: {gnn_val_loss:.4f}")
print(f" VAE - Train Loss: {vae_train_loss:.4f}, Val Loss: {vae_val_loss:.4f}")
if early_stopping_gnn.early_stop and early_stopping_vae.early_stop:
print(f"Early stopping at epoch {epoch} for both GNN and VAE")
break
gc.collect()
gnn.load_state_dict(torch.load('best_gnn_model.pt', weights_only=True))
vae.load_state_dict(torch.load('best_vae_model.pt', weights_only=True))
num_samples = 50
z = torch.randn(num_samples, latent_dim).to(device)
with torch.no_grad():
generated = vae.decoder(z).cpu().numpy()
generated = scaler.inverse_transform(generated)
new_df = pd.DataFrame(generated, columns=features)
new_df['predicted_band_gap'] = 0.0
new_df['confidence_score'] = 0.0
generated_loader = create_graph_data(new_df, generated, new_df['predicted_band_gap'].values, batch_size=batch_size)
generated_predictions = predict_with_uncertainty(gnn, [generated_loader], device)
gen_means, gen_confidences = zip(*generated_predictions)
```
### Visualization and Analysis
- **Purpose**: Visualize training progress and feature importance for model insights.
- **Implementation**:
```python
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import shap
import numpy as np
from torch_geometric.data import Data
from torch_geometric.loader import DataLoader as GeometricDataLoader
pca = PCA(n_components=2)
latent_pca = pca.fit_transform(generated)
tsne = TSNE(n_components=2, random_state=42)
latent_tsne = tsne.fit_transform(generated)
def gnn_predict(features):
if len(features.shape) == 1:
features = features.reshape(1, -1)
num_graphs = features.shape[0]
data_list = []
for i in range(num_graphs):
x = torch.tensor(features[i], dtype=torch.float32).to(device).reshape(-1, 1)
num_nodes = x.shape[0]
edge_index = torch.tensor([[i, j] for i in range(num_nodes) for j in range(num_nodes) if i != j], dtype=torch.long).t().contiguous().to(device)
data = Data(x=x, edge_index=edge_index)
data_list.append(data)
loader = GeometricDataLoader(data_list, batch_size=1, shuffle=False)
preds = []
with torch.no_grad():
for batch in loader:
batch = batch.to(device)
out = gnn(batch).cpu().numpy()
preds.append(out)
return np.concatenate(preds, axis=0)
background = X_scaled
explainer = shap.KernelExplainer(gnn_predict, background)
shap_values = explainer.shap_values(X_scaled[:5], nsamples=50)
fig1, axs1 = plt.subplots(1, 2, figsize=(20, 8))
axs1[0].plot(gnn_train_losses, label='Train Loss', color='blue')
axs1[0].plot(gnn_val_losses, label='Val Loss', color='orange')
axs1[0].set_title('GNN Training and Validation Loss', fontsize=14)
axs1[0].set_xlabel('Epoch', fontsize=12)
axs1[0].set_ylabel('Loss', fontsize=12)
axs1[0].legend(fontsize=10)
axs1[0].grid(True, linestyle='--', alpha=0.5)
axs1[1].plot(vae_train_losses, label='Train Loss', color='blue')
axs1[1].plot(vae_val_losses, label='Val Loss', color='orange')
axs1[1].set_title('VAE Training and Validation Loss', fontsize=14)
axs1[1].set_xlabel('Epoch', fontsize=12)
axs1[1].set_ylabel('Loss', fontsize=12)
axs1[1].legend(fontsize=10)
axs1[1].grid(True, linestyle='--', alpha=0.5)
fig1.tight_layout(pad=3.0)
fig3, axs3 = plt.subplots(1, 1, figsize=(10, 8))
axs3.scatter(gen_means, gen_confidences, c='green', alpha=0.6, s=60)
axs3.set_title('VAE-Generated Band Gap vs Confidence', fontsize=14)
axs3.set_xlabel('Predicted Band Gap (eV)', fontsize=12)
axs3.set_ylabel('Confidence Score (%)', fontsize=12)
axs3.set_xlim(0.5, 3.5)
axs3.set_ylim(0, 100)
axs3.grid(True, linestyle='--', alpha=0.5)
fig3.tight_layout(pad=3.0)
fig4, axs4 = plt.subplots(1, 2, figsize=(20, 8))
plt.sca(axs4[0])
shap.summary_plot(shap_values, features=features, plot_type="bar", show=False)
axs4[0].set_title('SHAP Feature Importance (Bar)', fontsize=14)
plt.sca(axs4[1])
shap.summary_plot(shap_values, X_scaled[:5], feature_names=features, show=False)
axs4[1].set_title('SHAP Feature Importance (Beeswarm)', fontsize=14)
fig4.tight_layout(pad=3.0)
plt.show()
```
### Final Report
- **Dataset Size**: 13,212 samples
- **GNN Performance**:
- Final Training Loss: 0.0000
- Final Validation Loss: 0.0000
- Mean Predicted Band Gap: 1.6253 eV
- Mean Prediction Uncertainty (Std): 0.1373 eV
- Mean Absolute Error: 0.3693 eV
- **VAE Performance**:
- Final Training Loss: 139.5652
- Final Validation Loss: 136.5951
- Mean Generated Band Gap: 1.5983 eV
- Mean Confidence Score: 99.99%
## Nexa_Inference Application
- **Purpose**: Provide a unified platform for scientific ML predictions in biology (protein structure), astrophysics (stellar properties), and material science (material properties).
- **Access**: Available via a REST API, returning JSON-formatted predictions with confidence scores (0–100%).
- **Implementation**: Integrates the latest Nexa models for scalable inference.
## Lessons Learned in Scientific Machine Learning
- **Technical Immaturity**:
- **Challenge**: Tools often lack real-world validation and are fragmented.
- **Solution**: Offer cost-effective, low-effort models via API for JSON predictions and dataset prototyping.
- **Computational Constraints**:
- **Challenge**: High dimensionality and feature complexity.
- **Solution**: Use ensemble approaches with specialized architectures (e.g., CNN+BiLSTM for secondary structure prediction, achieving 70–80% accuracy vs. 85% for SDTA models).
- **Integration and Adoption**:
- **Strategy**: Implement staggered releases with beta testing by researchers, followed by feedback-driven expansion.
- **Domains**: Physics, Biology, Material Science.
## Resources
[[Physics Node]]
[[Biology node]]
[[Material Science Node]]
[[Λ-codex]]
[[Nexa_protocol]]
[[Geo-Intelligence Node(Coming Soon)]]
[[Socio-Intelligence Node(Coming Soon)]]
[[OSINT(Coming Soon)]]
[[Nexa_Ideas]]
[[Nexa_dev]]
[[Nexus_Lattice_Learn]]
[[Tools]]