Notebooks
H
Hugging Face
Agent Data Analyst

Agent Data Analyst

zh-CNhf-cookbooknotebooks

数据分析智能体:瞬间获取数据洞察 ✨

作者:Aymeric Roucher

本教程为高级教程。建议先了解另一本手册的内容!

在本 Notebook 中,我们将创建一个数据分析智能体:一个配备数据分析库的代码智能体,能够加载和转换数据框,从中提取洞察,甚至绘制结果图表!

假设我想分析 Kaggle Titanic 挑战的数据,以预测每个乘客的生还情况。但在我深入挖掘之前,我希望一个自动化智能体为我准备分析,提取趋势并绘制一些图形来寻找洞察。

让我们开始设置这个系统。

运行下面的代码以安装所需的依赖:

[ ]

我们首先创建智能体。我们使用了 ReactCodeAgent(请阅读文档了解更多关于智能体类型的信息),因此我们甚至不需要为其提供任何工具:它可以直接运行代码。

我们只需要确保它能够使用与数据科学相关的库,方法是将这些库传递给 additional_authorized_imports 参数:["numpy", "pandas", "matplotlib.pyplot", "seaborn"]

一般来说,当在 additional_authorized_imports 中传递库时,确保这些库已在本地环境中安装,因为 Python 解释器只能使用已安装的库。

⚙ 我们的智能体将由 meta-llama/Meta-Llama-3.1-70B-Instruct 提供支持,使用 HfEngine 类,这个类通过 HF 的推理 API 实现:推理 API 使得运行任何操作系统模型变得快速而简单。

[ ]

数据分析 📊🤔

在运行智能体时,我们提供了来自竞赛的额外说明,并将其作为关键字参数(kwarg)传递给 run 方法:

[4]
[ ]
[6]

Here are the correlations and trends found in the data:

1. **Correlation between age and survival rate**: The correlation is -0.0772, which suggests that as age increases, the survival rate decreases. This implies that older passengers were less likely to survive the Titanic disaster.

2. **Relationship between Pclass and survival rate**: The survival rates for each Pclass are:
   - Pclass 1: 62.96%
   - Pclass 2: 47.28%
   - Pclass 3: 24.24%
   This shows that passengers in higher socio-economic classes (Pclass 1 and 2) had a significantly higher survival rate compared to those in the lower class (Pclass 3).

3. **Relationship between fare and survival rate**: The correlation is 0.2573, which suggests a moderate positive relationship between fare and survival rate. This implies that passengers who paid higher fares were more likely to survive the disaster.

令人印象深刻,不是吗?你还可以为你的智能体提供一个可视化工具,让它能够反思自己绘制的图表!

数据科学智能体:进行预测 🛠️

👉 现在让我们深入一步:我们将让我们的模型在数据上执行预测。

为此,我们还需要让它使用 sklearn,并将其添加到 additional_authorized_imports 中。

[8]
======== New task ========
You are an expert machine learning engineer.
Please train a ML model on "titanic/train.csv" to predict the survival for rows of "titanic/test.csv".
Output the results under './output.csv'.
Take care to import functions and modules before using them!

You have been provided with these initial arguments: {'additional_notes': "\n### Variable Notes\npclass: A proxy for socio-economic status (SES)\n1st = Upper\n2nd = Middle\n3rd = Lower\nage: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5\nsibsp: The dataset defines family relations in this way...\nSibling = brother, sister, stepbrother, stepsister\nSpouse = husband, wife (mistresses and fiancés were ignored)\nparch: The dataset defines family relations in this way...\nParent = mother, father\nChild = daughter, son, stepdaughter, stepson\nSome children travelled only with a nanny, therefore parch=0 for them.\n\nHere is the final answer:\n\n**Interesting Questions**\n\n1. Is there a correlation between socio-economic status (Pclass) and survival rate?\n2. Is there a correlation between age and survival rate?\n3. Is there a correlation between family size and survival rate?\n\n**Answers to Questions**\n\n1. Correlation between Pclass and survival rate: -0.338\nInsight: Passengers from lower socio-economic backgrounds were less likely to survive.\n2. Correlation between age and survival rate: -0.077\nInsight: Older passengers were slightly less likely to survive.\n3. Correlation between family size and survival rate: 0.017\nInsight: Passengers traveling with larger families were slightly more likely to survive, but this correlation is very weak and may not be significant.\n\n**Summary of Correlations and Trends**\n\nThe analysis of the Titanic dataset reveals several interesting trends and correlations. Firstly, the socio-economic status of passengers played a significant role in their survival, with passengers from lower socio-economic backgrounds being less likely to survive. Secondly, age was a weak predictor of survival, with older passengers being slightly less likely to survive. Finally, family size had a very weak positive correlation with survival rate, suggesting that passengers traveling with larger families were slightly more likely to survive, but this correlation is very weak and may not be significant.\n\n**Plots**\n\n(Attached are the plots generated using matplotlib/seaborn and saved to the './figures/' folder)\n\nI hope this meets the requirements!"}.
==== Agent is executing the code below:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load the dataset
train_df = pd.read_csv('titanic/train.csv')
test_df = pd.read_csv('titanic/test.csv')

print(train_df.head())
print(test_df.head())
====
Print outputs:
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  
   PassengerId  Pclass                                          Name     Sex  \
0          892       3                              Kelly, Mr. James    male   
1          893       3              Wilkes, Mrs. James (Ellen Needs)  female   
2          894       2                     Myles, Mr. Thomas Francis    male   
3          895       3                              Wirz, Mr. Albert    male   
4          896       3  Hirvonen, Mrs. Alexander (Helga E Lindqvist)  female   

    Age  SibSp  Parch   Ticket     Fare Cabin Embarked  
0  34.5      0      0   330911   7.8292   NaN        Q  
1  47.0      1      0   363272   7.0000   NaN        S  
2  62.0      0      0   240276   9.6875   NaN        Q  
3  27.0      0      0   315154   8.6625   NaN        S  
4  22.0      1      1  3101298  12.2875   NaN        S  

==== Agent is executing the code below:
# Handle missing values
train_df['Age'].fillna(train_df['Age'].median(), inplace=True)
test_df['Age'].fillna(test_df['Age'].median(), inplace=True)

train_df['Cabin'].fillna('Unknown', inplace=True)
test_df['Cabin'].fillna('Unknown', inplace=True)

# Encode categorical variables
train_df['Sex'] = train_df['Sex'].map({'male': 0, 'female': 1})
test_df['Sex'] = test_df['Sex'].map({'male': 0, 'female': 1})

train_df['Embarked'] = train_df['Embarked'].fillna('S')
test_df['Embarked'] = test_df['Embarked'].fillna('S')

train_df['Embarked'] = train_df['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})
test_df['Embarked'] = test_df['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})

print(train_df.head())
print(test_df.head())
====
Print outputs:
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name  Sex   Age  SibSp  Parch  \
0                            Braund, Mr. Owen Harris    0  22.0      1      0   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...    1  38.0      1      0   
2                             Heikkinen, Miss. Laina    1  26.0      0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)    1  35.0      1      0   
4                           Allen, Mr. William Henry    0  35.0      0      0   

             Ticket     Fare    Cabin  Embarked  
0         A/5 21171   7.2500  Unknown         0  
1          PC 17599  71.2833      C85         1  
2  STON/O2. 3101282   7.9250  Unknown         0  
3            113803  53.1000     C123         0  
4            373450   8.0500  Unknown         0  
   PassengerId  Pclass                                          Name  Sex  \
0          892       3                              Kelly, Mr. James    0   
1          893       3              Wilkes, Mrs. James (Ellen Needs)    1   
2          894       2                     Myles, Mr. Thomas Francis    0   
3          895       3                              Wirz, Mr. Albert    0   
4          896       3  Hirvonen, Mrs. Alexander (Helga E Lindqvist)    1   

    Age  SibSp  Parch   Ticket     Fare    Cabin  Embarked  
0  34.5      0      0   330911   7.8292  Unknown         2  
1  47.0      1      0   363272   7.0000  Unknown         0  
2  62.0      0      0   240276   9.6875  Unknown         2  
3  27.0      0      0   315154   8.6625  Unknown         0  
4  22.0      1      1  3101298  12.2875  Unknown         0  

==== Agent is executing the code below:
# Split data into features (X) and target (y)
X = train_df.drop(['Survived', 'PassengerId', 'Name', 'Ticket'], axis=1)
y = train_df['Survived']

# Split data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Random Forest Classifier model
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
rfc.fit(X_train, y_train)

print("Training accuracy:", accuracy_score(y_train, rfc.predict(X_train)))
print("Validation accuracy:", accuracy_score(y_val, rfc.predict(X_val)))
====
Code execution failed due to the following error:
could not convert string to float: 'C124'
Traceback (most recent call last):
  File "/Users/aymeric/Documents/Code/original_transformers/transformers/src/transformers/agents/agents.py", line 1054, in step
    result = self.python_evaluator(
  File "/Users/aymeric/Documents/Code/original_transformers/transformers/src/transformers/agents/python_interpreter.py", line 893, in evaluate_python_code
    result = evaluate_ast(node, state, static_tools, custom_tools, authorized_imports)
  File "/Users/aymeric/Documents/Code/original_transformers/transformers/src/transformers/agents/python_interpreter.py", line 780, in evaluate_ast
    return evaluate_ast(expression.value, state, static_tools, custom_tools)
  File "/Users/aymeric/Documents/Code/original_transformers/transformers/src/transformers/agents/python_interpreter.py", line 744, in evaluate_ast
    return evaluate_call(expression, state, static_tools, custom_tools)
  File "/Users/aymeric/Documents/Code/original_transformers/transformers/src/transformers/agents/python_interpreter.py", line 439, in evaluate_call
    output = func(*args, **kwargs)
  File "/Users/aymeric/venvs/disposable/lib/python3.10/site-packages/sklearn/base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/Users/aymeric/venvs/disposable/lib/python3.10/site-packages/sklearn/ensemble/_forest.py", line 363, in fit
    X, y = self._validate_data(
  File "/Users/aymeric/venvs/disposable/lib/python3.10/site-packages/sklearn/base.py", line 650, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/Users/aymeric/venvs/disposable/lib/python3.10/site-packages/sklearn/utils/validation.py", line 1263, in check_X_y
    X = check_array(
  File "/Users/aymeric/venvs/disposable/lib/python3.10/site-packages/sklearn/utils/validation.py", line 997, in check_array
    array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
  File "/Users/aymeric/venvs/disposable/lib/python3.10/site-packages/sklearn/utils/_array_api.py", line 521, in _asarray_with_order
    array = numpy.asarray(array, order=order, dtype=dtype)
  File "/Users/aymeric/venvs/disposable/lib/python3.10/site-packages/pandas/core/generic.py", line 2153, in __array__
    arr = np.asarray(values, dtype=dtype)
ValueError: could not convert string to float: 'C124'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/aymeric/Documents/Code/original_transformers/transformers/src/transformers/agents/agents.py", line 756, in direct_run
    step_logs = self.step()
  File "/Users/aymeric/Documents/Code/original_transformers/transformers/src/transformers/agents/agents.py", line 1072, in step
    raise AgentExecutionError(error_msg)
transformers.agents.agents.AgentExecutionError: Code execution failed due to the following error:
could not convert string to float: 'C124'
==== Agent is executing the code below:
# One-hot encode the Cabin feature
cabin_dummies = pd.get_dummies(train_df['Cabin'])
test_cabin_dummies = pd.get_dummies(test_df['Cabin'])

X = pd.concat([train_df.drop(['Survived', 'PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1), cabin_dummies], axis=1)
test_X = pd.concat([test_df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1), test_cabin_dummies], axis=1)

y = train_df['Survived']

# Split data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Random Forest Classifier model
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
rfc.fit(X_train, y_train)

print("Training accuracy:", accuracy_score(y_train, rfc.predict(X_train)))
print("Validation accuracy:", accuracy_score(y_val, rfc.predict(X_val)))
====
Print outputs:
Training accuracy: 0.9845505617977528
Validation accuracy: 0.7932960893854749

==== Agent is executing the code below:
# Make predictions on the test data
predictions = rfc.predict(test_X)

# Save the predictions to a submission file
submission_df = pd.DataFrame({
    'PassengerId': test_df['PassengerId'],
    'Survived': predictions
})

submission_df.to_csv('./output.csv', index=False)
final_answer("Submission file saved to./output.csv")
====
Code execution failed due to the following error:
The feature names should match those that were passed during fit.
Feature names unseen at fit time:
- A11
- A18
- A21
- A29
- A9
- ...
Feature names seen at fit time, yet now missing:
- A10
- A14
- A16
- A19
- A20
- ...

Traceback (most recent call last):
  File "/Users/aymeric/Documents/Code/original_transformers/transformers/src/transformers/agents/agents.py", line 1054, in step
    result = self.python_evaluator(
  File "/Users/aymeric/Documents/Code/original_transformers/transformers/src/transformers/agents/python_interpreter.py", line 893, in evaluate_python_code
    result = evaluate_ast(node, state, static_tools, custom_tools, authorized_imports)
  File "/Users/aymeric/Documents/Code/original_transformers/transformers/src/transformers/agents/python_interpreter.py", line 739, in evaluate_ast
    return evaluate_assign(expression, state, static_tools, custom_tools)
  File "/Users/aymeric/Documents/Code/original_transformers/transformers/src/transformers/agents/python_interpreter.py", line 322, in evaluate_assign
    result = evaluate_ast(assign.value, state, static_tools, custom_tools)
  File "/Users/aymeric/Documents/Code/original_transformers/transformers/src/transformers/agents/python_interpreter.py", line 744, in evaluate_ast
    return evaluate_call(expression, state, static_tools, custom_tools)
  File "/Users/aymeric/Documents/Code/original_transformers/transformers/src/transformers/agents/python_interpreter.py", line 439, in evaluate_call
    output = func(*args, **kwargs)
  File "/Users/aymeric/venvs/disposable/lib/python3.10/site-packages/sklearn/ensemble/_forest.py", line 905, in predict
    proba = self.predict_proba(X)
  File "/Users/aymeric/venvs/disposable/lib/python3.10/site-packages/sklearn/ensemble/_forest.py", line 947, in predict_proba
    X = self._validate_X_predict(X)
  File "/Users/aymeric/venvs/disposable/lib/python3.10/site-packages/sklearn/ensemble/_forest.py", line 641, in _validate_X_predict
    X = self._validate_data(
  File "/Users/aymeric/venvs/disposable/lib/python3.10/site-packages/sklearn/base.py", line 608, in _validate_data
    self._check_feature_names(X, reset=reset)
  File "/Users/aymeric/venvs/disposable/lib/python3.10/site-packages/sklearn/base.py", line 535, in _check_feature_names
    raise ValueError(message)
ValueError: The feature names should match those that were passed during fit.
Feature names unseen at fit time:
- A11
- A18
- A21
- A29
- A9
- ...
Feature names seen at fit time, yet now missing:
- A10
- A14
- A16
- A19
- A20
- ...


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/aymeric/Documents/Code/original_transformers/transformers/src/transformers/agents/agents.py", line 756, in direct_run
    step_logs = self.step()
  File "/Users/aymeric/Documents/Code/original_transformers/transformers/src/transformers/agents/agents.py", line 1072, in step
    raise AgentExecutionError(error_msg)
transformers.agents.agents.AgentExecutionError: Code execution failed due to the following error:
The feature names should match those that were passed during fit.
Feature names unseen at fit time:
- A11
- A18
- A21
- A29
- A9
- ...
Feature names seen at fit time, yet now missing:
- A10
- A14
- A16
- A19
- A20
- ...

==== Agent is executing the code below:
# Get the common cabin values in both training and test data
common_cabins = set(train_df['Cabin'].unique()) & set(test_df['Cabin'].unique())

# Filter the cabin dummies to only include the common cabin values
cabin_dummies = pd.get_dummies(train_df['Cabin'].map(lambda x: x if x in common_cabins else 'Unknown'))
test_cabin_dummies = pd.get_dummies(test_df['Cabin'].map(lambda x: x if x in common_cabins else 'Unknown'))

X = pd.concat([train_df.drop(['Survived', 'PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1), cabin_dummies], axis=1)
test_X = pd.concat([test_df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1), test_cabin_dummies], axis=1)

y = train_df['Survived']

# Split data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Random Forest Classifier model
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
rfc.fit(X_train, y_train)

print("Training accuracy:", accuracy_score(y_train, rfc.predict(X_train)))
print("Validation accuracy:", accuracy_score(y_val, rfc.predict(X_val)))

# Make predictions on the test data
predictions = rfc.predict(test_X)

# Save the predictions to a submission file
submission_df = pd.DataFrame({
    'PassengerId': test_df['PassengerId'],
    'Survived': predictions
})

submission_df.to_csv('./output.csv', index=False)
final_answer("Submission file saved to./output.csv")
====
Print outputs:
Training accuracy: 0.9803370786516854
Validation accuracy: 0.8100558659217877

>>> Final answer:
Submission file saved to./output.csv

智能体输出的测试预测,一旦提交到 Kaggle,得分为 0.78229,在 17,360 名参赛者中排名 #2824,而且比我几年前第一次尝试这个挑战时艰难取得的成绩还要好。

你的结果可能会有所不同,但无论如何,我认为能够在几秒钟内通过智能体实现这一点,实在是非常令人印象深刻。

🚀 以上只是一个数据分析智能体的简单尝试:它肯定可以在很多方面进行改进,以更好地适应你的具体使用场景!