Corrigé des exercices
UE8: introduction à l’analyse de données du master Santé de l’ISTR
Imports de bibliothèques et chargement des données
import pandas
df = pandas.read_excel("Donnees_M2_RD.xlsx")
On affiche df
pour avoir une idée des données ous les yeux
df
Subject | Name_A | Name_B | Dist_A | Dist_B | Mode | Space | Side | Response | RT | |
---|---|---|---|---|---|---|---|---|---|---|
0 | P_ADI_331 | 0 | 2 | 2 | 4 | Dic | E | D | 2 | 18865 |
1 | P_ADI_331 | 1 | 4 | 4 | 1 | Dic | E | D | 2 | 13157 |
2 | P_ADI_331 | 4 | 3 | 3 | 2 | Dic | E | D | 1 | 11628 |
3 | P_ADI_331 | 2 | 4 | 4 | 1 | Dic | E | D | 1 | 10068 |
4 | P_ADI_331 | 1 | 2 | 2 | 4 | Dic | E | D | 1 | 11801 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9589 | P_VAR_330 | 0 | 1 | 3 | 5 | Dio | I | D | 1 | 7626 |
9590 | P_VAR_330 | 3 | 2 | 5 | 1 | Dio | I | D | 2 | 6349 |
9591 | P_VAR_330 | 2 | 0 | 4 | 2 | Dio | I | D | 2 | 9031 |
9592 | P_VAR_330 | 0 | 2 | 2 | 1 | Dio | I | D | 2 | 16323 |
9593 | P_VAR_330 | 0 | 3 | 5 | 1 | Dio | I | D | 2 | 10139 |
9594 rows × 10 columns
Pandas
Projections et filtres simples
Donner les valeurs de la colonne “RT”
df["RT"]
0 18865
1 13157
2 11628
3 10068
4 11801
...
9589 7626
9590 6349
9591 9031
9592 16323
9593 10139
Name: RT, Length: 9594, dtype: int64
Donner les différentes valeurs possible pour la colonne “Subject”
df["Subject"].drop_duplicates()
0 P_ADI_331
400 P_ALM_345
800 P_AMY_346
1200 P_BAM_347
1600 P_BEH_340
2000 P_BLC_325
2399 P_BLR_321
2798 P_BOA_321
3197 P_BOC_342
3597 P_CAR_327
3995 P_CAV_333
4395 P_CON_336
4795 P_GAM_338
5195 P_GHM_334
5595 P_GRC_341
5995 P_GRF_322
6394 P_LAC_354
6794 P_LEG_335
7194 P_MOE_339
7594 P_ROS_336
7994 P_SOA_337
8394 P_TAI_343
8794 P_VAL_329
9194 P_VAR_330
Name: Subject, dtype: object
Filtrer la dataframe pour ne garder que les lignes dont le “RT” est plus petit que 10 000.
cond = (df["RT"] < 10000)
cond
0 False
1 False
2 False
3 False
4 False
...
9589 True
9590 True
9591 True
9592 False
9593 False
Name: RT, Length: 9594, dtype: bool
df[cond]
Subject | Name_A | Name_B | Dist_A | Dist_B | Mode | Space | Side | Response | RT | |
---|---|---|---|---|---|---|---|---|---|---|
20 | P_ADI_331 | 1 | 2 | 3 | 5 | Dic | E | D | 1 | 8564 |
24 | P_ADI_331 | 1 | 4 | 1 | 5 | Dic | E | D | 1 | 7805 |
27 | P_ADI_331 | 2 | 0 | 3 | 1 | Dic | E | D | 2 | 9533 |
29 | P_ADI_331 | 1 | 4 | 3 | 2 | Dic | E | D | 2 | 9887 |
33 | P_ADI_331 | 0 | 4 | 5 | 2 | Dic | E | D | 2 | 8152 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9587 | P_VAR_330 | 2 | 3 | 4 | 1 | Dio | I | D | 2 | 7823 |
9588 | P_VAR_330 | 0 | 3 | 3 | 1 | Dio | I | D | 2 | 6153 |
9589 | P_VAR_330 | 0 | 1 | 3 | 5 | Dio | I | D | 1 | 7626 |
9590 | P_VAR_330 | 3 | 2 | 5 | 1 | Dio | I | D | 2 | 6349 |
9591 | P_VAR_330 | 2 | 0 | 4 | 2 | Dio | I | D | 2 | 9031 |
4528 rows × 10 columns
Combiner les projections et les filtres
Indiquer la valeur de “RT” pour les essais en mode “Dic”.
df[df["Mode"] == "Dic"]["RT"]
0 18865
1 13157
2 11628
3 10068
4 11801
...
9539 11587
9540 7388
9541 5454
9542 11399
9543 5379
Name: RT, Length: 4795, dtype: int64
Créer une Series indiquant pour chaque essai si la réponse est juste ou non.
(((df["Dist_A"] > df["Dist_B"]) & (df["Response"] == 2))
| ((df["Dist_A"] < df["Dist_B"]) & (df["Response"] == 1)))
0 False
1 True
2 False
3 False
4 True
...
9589 True
9590 True
9591 True
9592 True
9593 True
Length: 9594, dtype: bool
Indiquer les “Suject” qui ont répondu en moins de 4 000 (“RT” inférieur à 4 000)
df[df["RT"] < 4000]["Subject"].drop_duplicates()
645 P_ALM_345
883 P_AMY_346
1306 P_BAM_347
2407 P_BLR_321
3263 P_BOC_342
4133 P_CAV_333
4458 P_CON_336
5306 P_GHM_334
5628 P_GRC_341
7188 P_LEG_335
7452 P_MOE_339
7941 P_ROS_336
9252 P_VAR_330
Name: Subject, dtype: object
Agrégations globales
Donner la moyenne de “RT”.
df["RT"].mean()
12050.088597039816
Donner le plus grand “RT” pour le “Subject” “P_ROS_336”
df[df["Subject"] == "P_ROS_336"]["RT"].max()
54405
Donner l’écart-type de “RT” pour les essais réussis (utiliser .std()
)
reussis = ((df["Dist_A"] > df["Dist_B"]) & (df["Response"] == 2)) | ((df["Dist_A"] < df["Dist_B"]) & (df["Response"] == 1))
df[reussis]["RT"].std()
6505.738105682206
Création de colonnes
Ajouter une colonne “nombre” valant 1
df["nombre"] = 1
df
Subject | Name_A | Name_B | Dist_A | Dist_B | Mode | Space | Side | Response | RT | nombre | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | P_ADI_331 | 0 | 2 | 2 | 4 | Dic | E | D | 2 | 18865 | 1 |
1 | P_ADI_331 | 1 | 4 | 4 | 1 | Dic | E | D | 2 | 13157 | 1 |
2 | P_ADI_331 | 4 | 3 | 3 | 2 | Dic | E | D | 1 | 11628 | 1 |
3 | P_ADI_331 | 2 | 4 | 4 | 1 | Dic | E | D | 1 | 10068 | 1 |
4 | P_ADI_331 | 1 | 2 | 2 | 4 | Dic | E | D | 1 | 11801 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9589 | P_VAR_330 | 0 | 1 | 3 | 5 | Dio | I | D | 1 | 7626 | 1 |
9590 | P_VAR_330 | 3 | 2 | 5 | 1 | Dio | I | D | 2 | 6349 | 1 |
9591 | P_VAR_330 | 2 | 0 | 4 | 2 | Dio | I | D | 2 | 9031 | 1 |
9592 | P_VAR_330 | 0 | 2 | 2 | 1 | Dio | I | D | 2 | 16323 | 1 |
9593 | P_VAR_330 | 0 | 3 | 5 | 1 | Dio | I | D | 2 | 10139 | 1 |
9594 rows × 11 columns
Ajouter une colonne “reussi” qui indique si l’essai est réussi.
df["reussi"] = reussis
df
Subject | Name_A | Name_B | Dist_A | Dist_B | Mode | Space | Side | Response | RT | nombre | reussi | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | P_ADI_331 | 0 | 2 | 2 | 4 | Dic | E | D | 2 | 18865 | 1 | False |
1 | P_ADI_331 | 1 | 4 | 4 | 1 | Dic | E | D | 2 | 13157 | 1 | True |
2 | P_ADI_331 | 4 | 3 | 3 | 2 | Dic | E | D | 1 | 11628 | 1 | False |
3 | P_ADI_331 | 2 | 4 | 4 | 1 | Dic | E | D | 1 | 10068 | 1 | False |
4 | P_ADI_331 | 1 | 2 | 2 | 4 | Dic | E | D | 1 | 11801 | 1 | True |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9589 | P_VAR_330 | 0 | 1 | 3 | 5 | Dio | I | D | 1 | 7626 | 1 | True |
9590 | P_VAR_330 | 3 | 2 | 5 | 1 | Dio | I | D | 2 | 6349 | 1 | True |
9591 | P_VAR_330 | 2 | 0 | 4 | 2 | Dio | I | D | 2 | 9031 | 1 | True |
9592 | P_VAR_330 | 0 | 2 | 2 | 1 | Dio | I | D | 2 | 16323 | 1 | True |
9593 | P_VAR_330 | 0 | 3 | 5 | 1 | Dio | I | D | 2 | 10139 | 1 | True |
9594 rows × 12 columns
Ajouter une colonne “nombre_reussis” qui vaut 1 si l’essai est réussi et 0 sinon.
df_r = pandas.DataFrame(df[df["reussi"]])
df_r["nombre_reussis"] = 1
df_e = pandas.DataFrame(df[~ reussis])
df_e["nombre_reussis"] = 0
df2 = pandas.concat([df_r,df_e])
df2
Subject | Name_A | Name_B | Dist_A | Dist_B | Mode | Space | Side | Response | RT | nombre | reussi | nombre_reussis | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | P_ADI_331 | 1 | 4 | 4 | 1 | Dic | E | D | 2 | 13157 | 1 | True | 1 |
4 | P_ADI_331 | 1 | 2 | 2 | 4 | Dic | E | D | 1 | 11801 | 1 | True | 1 |
6 | P_ADI_331 | 2 | 1 | 3 | 4 | Dic | E | D | 1 | 16347 | 1 | True | 1 |
7 | P_ADI_331 | 0 | 3 | 2 | 4 | Dic | E | D | 1 | 13237 | 1 | True | 1 |
8 | P_ADI_331 | 2 | 0 | 4 | 2 | Dic | E | D | 2 | 12589 | 1 | True | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9579 | P_VAR_330 | 4 | 3 | 3 | 4 | Dio | I | D | 2 | 12271 | 1 | False | 0 |
9581 | P_VAR_330 | 3 | 2 | 5 | 4 | Dio | I | D | 1 | 11327 | 1 | False | 0 |
9582 | P_VAR_330 | 1 | 0 | 3 | 5 | Dio | I | D | 2 | 15942 | 1 | False | 0 |
9583 | P_VAR_330 | 0 | 4 | 4 | 5 | Dio | I | D | 2 | 45627 | 1 | False | 0 |
9585 | P_VAR_330 | 0 | 3 | 2 | 3 | Dio | I | D | 2 | 16671 | 1 | False | 0 |
9594 rows × 13 columns
Méthode alternative avec .loc[condition,colonne]
df.loc[df["reussi"], "nombre_reussis"] = 1
df.loc[~ (df["reussi"]), "nombre_reussis"] = 0
df
Subject | Name_A | Name_B | Dist_A | Dist_B | Mode | Space | Side | Response | RT | nombre | reussi | nombre_reussis | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | P_ADI_331 | 0 | 2 | 2 | 4 | Dic | E | D | 2 | 18865 | 1 | False | 0.0 |
1 | P_ADI_331 | 1 | 4 | 4 | 1 | Dic | E | D | 2 | 13157 | 1 | True | 1.0 |
2 | P_ADI_331 | 4 | 3 | 3 | 2 | Dic | E | D | 1 | 11628 | 1 | False | 0.0 |
3 | P_ADI_331 | 2 | 4 | 4 | 1 | Dic | E | D | 1 | 10068 | 1 | False | 0.0 |
4 | P_ADI_331 | 1 | 2 | 2 | 4 | Dic | E | D | 1 | 11801 | 1 | True | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9589 | P_VAR_330 | 0 | 1 | 3 | 5 | Dio | I | D | 1 | 7626 | 1 | True | 1.0 |
9590 | P_VAR_330 | 3 | 2 | 5 | 1 | Dio | I | D | 2 | 6349 | 1 | True | 1.0 |
9591 | P_VAR_330 | 2 | 0 | 4 | 2 | Dio | I | D | 2 | 9031 | 1 | True | 1.0 |
9592 | P_VAR_330 | 0 | 2 | 2 | 1 | Dio | I | D | 2 | 16323 | 1 | True | 1.0 |
9593 | P_VAR_330 | 0 | 3 | 5 | 1 | Dio | I | D | 2 | 10139 | 1 | True | 1.0 |
9594 rows × 13 columns
Groupes simples
Donner la moyenne du temps de réaction (“RT”) pour chaque “Subject”.
df.groupby(by="Subject")["RT"].mean()
Subject
P_ADI_331 10042.732500
P_ALM_345 13103.350000
P_AMY_346 9561.292500
P_BAM_347 9096.540000
P_BEH_340 13187.195000
P_BLC_325 11366.338346
P_BLR_321 9657.929825
P_BOA_321 15787.130326
P_BOC_342 12779.680000
P_CAR_327 12457.844221
P_CAV_333 10551.257500
P_CON_336 8352.742500
P_GAM_338 11805.215000
P_GHM_334 11220.800000
P_GRC_341 11213.115000
P_GRF_322 15652.556391
P_LAC_354 15617.710000
P_LEG_335 10904.325000
P_MOE_339 8470.447500
P_ROS_336 16019.425000
P_SOA_337 10051.175000
P_TAI_343 16413.327500
P_VAL_329 13661.730000
P_VAR_330 12240.965000
Name: RT, dtype: float64
Donner le nombre d’essais pour chaque “Subject” en utilisant .count()
.
df.groupby(by="Subject")["RT"].count()
Subject
P_ADI_331 400
P_ALM_345 400
P_AMY_346 400
P_BAM_347 400
P_BEH_340 400
P_BLC_325 399
P_BLR_321 399
P_BOA_321 399
P_BOC_342 400
P_CAR_327 398
P_CAV_333 400
P_CON_336 400
P_GAM_338 400
P_GHM_334 400
P_GRC_341 400
P_GRF_322 399
P_LAC_354 400
P_LEG_335 400
P_MOE_339 400
P_ROS_336 400
P_SOA_337 400
P_TAI_343 400
P_VAL_329 400
P_VAR_330 400
Name: RT, dtype: int64
Donner le nombre d’essais pour chaque “Subject” en utilisant .sum()
et la colonne “nombre”.
df.groupby(by="Subject")["nombre"].sum()
Subject
P_ADI_331 400
P_ALM_345 400
P_AMY_346 400
P_BAM_347 400
P_BEH_340 400
P_BLC_325 399
P_BLR_321 399
P_BOA_321 399
P_BOC_342 400
P_CAR_327 398
P_CAV_333 400
P_CON_336 400
P_GAM_338 400
P_GHM_334 400
P_GRC_341 400
P_GRF_322 399
P_LAC_354 400
P_LEG_335 400
P_MOE_339 400
P_ROS_336 400
P_SOA_337 400
P_TAI_343 400
P_VAL_329 400
P_VAR_330 400
Name: nombre, dtype: int64
Donner le nombre d’essais réussis pour chaque “Subject” en utilisant un filtre puis .count()
.
df[df["reussi"]].groupby(by="Subject")["RT"].count()
Subject
P_ADI_331 328
P_ALM_345 324
P_AMY_346 323
P_BAM_347 312
P_BEH_340 337
P_BLC_325 307
P_BLR_321 303
P_BOA_321 336
P_BOC_342 299
P_CAR_327 270
P_CAV_333 345
P_CON_336 333
P_GAM_338 344
P_GHM_334 328
P_GRC_341 322
P_GRF_322 291
P_LAC_354 319
P_LEG_335 323
P_MOE_339 350
P_ROS_336 337
P_SOA_337 305
P_TAI_343 316
P_VAL_329 280
P_VAR_330 314
Name: RT, dtype: int64
Donner le nombre d’essais réussis pour chaque “Subject” en utilisant .sum()
et la colonne “nombre_reussis”.
df.groupby(by="Subject")["nombre_reussis"].sum()
Subject
P_ADI_331 328.0
P_ALM_345 324.0
P_AMY_346 323.0
P_BAM_347 312.0
P_BEH_340 337.0
P_BLC_325 307.0
P_BLR_321 303.0
P_BOA_321 336.0
P_BOC_342 299.0
P_CAR_327 270.0
P_CAV_333 345.0
P_CON_336 333.0
P_GAM_338 344.0
P_GHM_334 328.0
P_GRC_341 322.0
P_GRF_322 291.0
P_LAC_354 319.0
P_LEG_335 323.0
P_MOE_339 350.0
P_ROS_336 337.0
P_SOA_337 305.0
P_TAI_343 316.0
P_VAL_329 280.0
P_VAR_330 314.0
Name: nombre_reussis, dtype: float64