--- title: "capesR" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{capesR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(capesR) ``` # Introduction The **capesR** package was developed to facilitate access and manipulation of data from the Catalog of Theses and Dissertations of the Brazilian Coordination for the Improvement of Higher Education Personnel (CAPES). This catalog includes information on theses and dissertations defended at higher education institutions in Brazil, with variables such as: - **institution**: Higher Education Institution. - **area**: Academic field of the work. - **region** and **state**: Location where the research was conducted. - **abstract**: Brief description of the topic and objectives. This package automates the process of obtaining and organizing this data, making it easily accessible for analysis and reporting. The original CAPES data is available at [dadosabertos.capes.gov.br](https://dadosabertos.capes.gov.br/group/catalogo-de-teses-e-dissertacoes-brasil). The data used in this package is hosted in the [The Open Science Framework (OSF)](https://osf.io/4a5b7/). # Installation To install the package, use: ```r devtools::install_github("hugoavmedeiros/capesR") ``` # Functions ## Download Data The `download_capes_data` function allows you to download CAPES data files hosted on OSF. You can specify the desired years, and the corresponding files will be saved locally. ### Example 1 Download data using the temporary directory (function default): ```r library(capesR) library(dplyr) # Download data for 1987 and 1990 capes_files <- download_capes_data(c(1987, 1990)) # View the list of downloaded files capes_files %>% glimpse() ``` In this case, the data will not persist for future uses. ### Example 2 - Reusing Data It is recommended to define a persistent directory to store the downloaded data instead of using the default temporary directory (`tempdir()`). This will allow you to reuse the data in the future. ```r # Define the directory to store the data data_directory <- "/capes_data" # Download data for 1987 and 1990 using a persistent directory capes_files <- download_capes_data( c(1987, 1990), destination = data_directory) ``` When using a persistent directory, the data will be downloaded only once. In future uses, the function will identify which files already exist in the directory and return their paths. ## Combine Data Use the `read_capes_data` function to combine the downloaded files from a list generated by the `download_capes_data` function or manually created. ### Example 1 - Combining Data Without Filters ```r # Combine all selected data without using filters combined_data <- read_capes_data(capes_files) # View the combined data combined_data %>% glimpse() ``` ### Example 2 - Combining Data with Exact Filters Filters are applied before the data is read, improving performance. ```r # Create an object with filters exact_filter <- list( base_year = c(2021, 2022), state = c("PE", "CE") ) # Combine filtered data filtered_data <- read_capes_data(capes_files, exact_filter) # View the filtered data filtered_data %>% glimpse() ``` ### Example 3 - Combining Data with Text Filters Exact filters are applied before reading the data for better performance, and the text filter is optimized to accelerate the search. ```r # Create an object with filters text_filter <- list( base_year = c(2018, 2019, 2020, 2021, 2022), state = c("PE", "CE"), title = "education" ) # Combine filtered data text_filtered_data <- read_capes_data(capes_files, text_filter) # View the filtered data text_filtered_data %>% glimpse() ``` ## Search Text To search for text in already combined data, you can use the `search_capes_text` function, specifying the term and the text field (e.g., title, abstract, author, or advisor). ### Example: ```r results <- search_capes_text( data = combined_data, term = "education", field = "title" ) ``` # Data ## Synthetic Data The package also provides a set of synthetic data, `capes_synthetic_df`, containing aggregated information from the CAPES Catalog of Theses and Dissertations. This synthetic dataset facilitates quick analyses and prototyping without requiring full downloads and processing. ### Data Structure The synthetic data includes the following columns: - **base_year**: Reference year of the data. - **institution**: Higher Education Institution. - **area**: Area of Concentration. - **program_name**: Name of the Graduate Program. - **type**: Type of work (e.g., Master's, Doctorate). - **region**: Region of Brazil. - **state**: Federative Unit (state). - **n**: Total number of works. ### Loading the Data The synthetic data is available directly in the package and can be loaded with: ```r data(capes_synthetic_df) # View the first rows of the data head(capes_synthetic_df) ``` ### Example Usage You can use the synthetic data for quick exploratory analyses or charts: ```r # Load the data data(capes_synthetic_df) # Example: Count by year and type of work library(dplyr) capes_synthetic_df %>% group_by(base_year, type) %>% summarise(total = sum(n)) %>% arrange(desc(total)) ``` # Conclusion This tutorial presented the basic steps to use the **capesR** package. For more information about the available functions, consult the documentation with: ```r ?years_osf ?capes_synthetic_df ?download_capes_data ?search_capes_text ?read_capes_data ```